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PREFACE TO THE REVISED EDITION 


During the foxxrteen years that have elapsed since the first edi- 
tion of this book was published there has been a very considerable 
extension of the use of statistical methods in business, in public 
administration, and in all the social sciences. The pressing require- 
ments of new tasks and new problems, together with increasing 
knowledge of statistical procedures on the part of administrative 
and research workers, have contributed to this extension. With 
this development, the older controversies over qualitative versus 
quantitative methods h^ve largely been shelved. It is clear that 
different problems call for different procedures; that the men who 
are grappling with research problems differ, as regards the methods 
of analysis they find congenial and fruitful; that induction and 
deduction are complementary phases of the processes that lead to 
scientific advance. The choice of research procedures does not 
necessitate the acceptance of one method and the rejection of an- 
other; it calls for the finding of a blond of methods that is adapted 
to a particular set of problems, and that is smted to the tempera- 
ment and abilities of the human agent that employs them. For 
workers dealing with social and economic relations, statistical 
methods constitute an essential element of this blend. Knowledge 
of systematic procedures for handling quantitative data, and skill 
in their use, are necessary parts of the equipment of students of the 
social sciences and of public and private administrators who must 
utilize the facts of experience in the formulation of policies. 

Gains on this front have been paralleled by notable improve- 
inontH in statistical techniques. The post-war years have witnessed, 
in this fiiold, the initiation of such another period of intellectual 
ferment and creative activity as that which, earlier, brought the 
great contributions of Karl Pearson and his associates. The older 
instruments of quantitative analysis have been refined and sharp- 
ened; methods of designing statistical experiments and formulat- 
ing and testing hypotheses have been improved; statistical infer- 
ence has been placed on a sounder foundation. There can be no 
doubt that these continuing improvements in the logic and in the 
technique of statistics will contribute in important ways to the 
advance of the social sciences and to the betterment of public and 
private administration. 
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In preparing the present edition of Statistical Methods account 
has been taken of the more important of the recent developments 
that have a bearing on the economic and business applications of 
statistics. In doing this I have sought to retain the main features 
of the first edition. A systematic development of the fundamentals 
of statistical method is needed by the beginning student. A work- 
ing compendium of procedures, with necessary aids to calculation 
and reference tables, is required by the statistician engagiKl in jul- 
ministration or research. The book is designed to meet these two 
needs. 

The eighteen chapters of the preseixt edition fall into two main 
divisions. The first twelve chapters deal with the descriptive sis- 
pects of statistics. Induction and sampling are purjwscly omitted 
in this development of basic descriptive procedures. Problems of 
statistical inference, with certain more advanced aspects of statis- 
tical description, are discussed in the last six chapters, and in 
appendices A to E. This orgaiiization is, I think, well iulapted to 
the needs of instruction. Some teachers may, indeed, prefer to 
introduce at an earlier point the concepts of samples and parent 
populations and the treatment of sampling errors. If so, selected 
pages from the chapter on elementary probabilities and the normal 
curve (Chapter XIII) and from the introductory chapter on induc- 
tion (Chapter XIV) may follow Chapter V in tho sequence of 
study. 

In the chapters added to this edition I have sought to exemplify 
economic applications of the newer methods of ai«Uy.sis. These 
methods offer rich and, as yet, largely unexidored ixwsibilities to 
research workers in tho social sdences. In these sections I have 
drawn heavily on tho path-breaking work of R, A. Ksher, I am 
indebted to Dr. Usher and his publishers, Oliver and Boyd of 
Edinbiurgh, for permission to include in this lx)ok the tabulations 
that appear in certain of the Appendix Tables. Thest^, with tho 
other tables included, arc designed to make the prewmt l)ook a 
reasonably complete working manual adapted to the needs of Imth 
laboratory worker and student. 

I must reaffirm my thanks to those who assisted me in various 
ways in the preparation of tho first edition. I am indebted, in 
addition, to Jacob M. Grould, Agnes B. Omundson, and William H. 
Mills for valuable aid in the details of the revision. 


May, 1938. 


F. C. M, 



PREFACE TO THE FIRST EDITION 


The last decade has witnessed a remarkable stimidation of 
interest in quantitative methods in business and in the social 
sciences. The day when intuition was the chief basis of business 
judgment and unsupported hj^othesis the mode in social studies 
seems to have passed. Following the lead of workers in the older 
and traditionally more accurate physical sciences, social scientists 
and serious students of business are employing in greater measure 
than ever before a method of study baaed upon the observation 
and analysis of facts. When these observations are quantitative 
in character appropriate methods are necessary for their organiza- 
tion and interpretation. This book deals with methods of com- 
bining and analyzing such observations, with primary emphasis 
upon materials drawn from the fields of economics and business. 

The justification for limiting the treatment to these particular 
fields is two-fold. Although general statistical methods are prac- 
tically universal in their application, special problems are en- 
countered in every field of study. This is particularly true in the 
realm of economics, which presents many distinctive diflSlculties 
and many characteristic problems. Methods that are in some 
degree specialized to meet these particular requirements have 
been developed, and these methods call for treatment in a work 
that is restricted in scope. In the second place, methods can 
be most effectively explained in terms of particular subjects; ab- 
stract methodology is barren of interest to the average person. 
For these reasons the book has been written with reference to the 
specific needs of quantitative workers in economics and business. 

In the explanation of methods no attempt has been made to 
secure the brevity of exposition which may be desirable in a 
strictly mathematical work. The purpose throughout has been 
to write for the learner not for the finished master, and the cxpla^ 
nations have been prepared with the needs of the former in mind. 
1 have felt free to omit certain detailed demonstrations of theorems 
because this book is presented as an introduction to the subject, 
not as an exhaustive treatise. 

The methods of quantitative analyEOS that are in general use 
today represent a long accretion, an accumulation of contribu- 
tions from workers in many fields. It would be vain to attempt to 
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enumerate all the individuals who have contributc'd to the dc'velop- 
ment of the science of statistics. Individual references are given 
in particular cases in the body of the text, but no list of such ac- 
knowledgments can serve as a complete record of the debt modern 
statisticians owe to their predecessors. 

For assistance in the preparation of the material contained in 
this book I am under many obligations. To Mr. H. E, And(‘rson 
and Professor H. B. Killough I am indebted for certain of th<' data 
employed in Chapters XI, XVI, and XVII. Prof(\ssor Warren M. 
Persons of the Harvard Committee on Economics R(‘search has 
courteously permitted me to make use of certain results of his work 
on commodity price index numbers. The index of indiLstrial a(*.tiv- 
ity presented in Chapter IX and utilized in Chapter XI is a product 
of the Statistical Division of the American Telephone and T(dc^ 
graph Company. I have employed it with the kind ptTmission of 
Mr. Seymour L. Andrew, Chief Statistician. Suggestions from 
Professor A. H. Mowbray of the University of California have cm- 
abled me to remove several obscurities that were present in an 
earlier mimeographed edition. I am deeply grateful to IVofessors 
Henry L. Moore, Theodore H. Brown, and H('nry S(*hultz for their 
help in critically reviewing portions of the manuscript. Vov lussist- 
ance at every stage of the work involvc<l in the writing of this book 
I am under deep obligation to Professor Donald H. Davenport. 
His aid in the collection of material, in the preparation of <‘harts, 
and in the onerous task of seeing the book thnnigh th(' press has 
been invaluable. To my wife, above all others, I am indf'btcHl 
for a measure of constant and generous help that cannot he iidc*- 
quately acknowledged here. 

F. a M. 

November, 1924. 
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CHAPTER I 


STATISTICAL METHODS AND THE PROBLEMS 
OF ECONOMICS AND BUSINESS 

The distinction between economics and business rests upon 
viewpoint and approach, rather than subject matter. ITie 
economist and the business man have different. objectives, 
but the substance of the science of economics and the mate- 
rials with which the art of business administration deals 
are in large part the same. In this treatise we ai’e con- 
cerned with methods that may be employed in h andling this 
common subject matter. 

Classes of Business Activity 

The tasks that confront business men may, without undue 
straining, be placed in three classes. First, in logical se- 
quence, are the technical tasks that arise in the processes 
of production, involving problems of chemistry and physics, 
of engineering, of animal husbandry, of navigation. The 
basic technical knowledge caUed for in the solution of these 
problems furnishes the foundation of our economic life. 
This is the domain of the hard-won arts of handling the 
raw materials and controlling the forces of nature. 

In the second class come activities that are connected 
with the internal organization and administration of indi- 
vidual business units. The technical functions of manipulat- 
ing organic and inorganic matter for the satisfaction of 
human wants are performed through administrative units, 
single farms, mines, factories, railroads, department stores. 
A whole new division of problems is faced by the business 
man in organizing these units, in coordinating the work of 
the different departments, in supervising the daily activities 

1 
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of the individuals making up each organization. While these 
are perhaps less fundamental than the technical pi'oblems of 
production, they are, for the average business man, more 
pressing and more difficult. Scientific method hiis made 
less progress in solving these latter problems. There is not 
the organized body of knowledge which is found in the 
former field, nor are there the same trained experts to whom 
the tasks may be delegated. 

The two types of economic activity named above include 
tasks that are in a sense self-centered and controllable. The 
manufacturer of steel has his technical problems of smelt.- 
ing and refining, his particular administrative duties. The 
faimer or mine-owner faces the same types of problenus, in 
forms peculiar to his own situation. In the performance of 
tasks in these fields each man is dealing with problems all 
the elements of which are under more or less perfect control. 
Difficulties arise, but these are ordinarily difficulties inhen'ut 
in the given task, not difficulties arising from a sudden (‘hange 
in the constituent elements of the problem, or the sudden 
interjection of a new factor. In this respect the third cate- 
gory of tasks to be performed by the business man differs 
materially from the first two. For this class is composed of 
problems the ele33aents of which are subject only in part 
to control by the individuals directly concerned. 

This third division includes buying and selling, and all 
the attendant activities that are carried on in terms of 
prices. As economic life is at present organized these func- 
tions are, to the business man, the most important ones he 
performs. The technical tasks of production and of internal 
organization and administration are but moans to an end. 
For the business man the goal of economic activity is the 
disposal of his product at a profit. The tasks preliminary to 
this final sale are of necessity subordmated to it, and so 
performed that the final aim may be achieved. The point 
of emphasis here is that the business man, in buying and 
selling, faces problems containing elements which he cannot 
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control. In securing his raw material, in bringing together 
the other agents needed in production, and in the final dis- 
posal of his product, the business man deals with markets — 
commodity markets, labor markets, money markets — and 
finds himself acting in relation to a system of prices quite 
beyond his control in its major movements. The other 
less fundamental phases of his activity are subject to a high 
degree of control, but when the business man comes to the 
final and most important act, the profitable sale of his prod- 
uct, his power of control dwindles. The motivating force in 
business activity is the hope of pecuniary profits, pecuniary 
profits depend upon successful buying and selling, successful 
buying and selling depend upon favorable conditions in an 
uncontrollable world of prices — here is the argument that 
states the major problem of business. And these are the 
facts which make the price system the dominating and 
all-important factor in modern business life. 

The modern entrepreneur lives m an environment of 
prices. The term “environment” is not an imapt figure; 
this world of prices in which the business man functions 
constitutes a coherent, consistent, well-articulated system 
of interdependent parts, a system which encompasses all 
the business activities of the entrepreneur. Since the system 
is beyond the control of the individual he must adapt him- 
self to it, and must base his activities upon as complete an 
understanding of the system as he may obtain. Without 
this understanding the major problems of business are in- 
capable of solution. 

Quantitative Character op Economic and Business 

Problems 

Problems falling in the first of the classes outlined above 
have long been recognized as essentially quantitative in 
character. Their solution calls for the application of the 
methods of precision which have been developed in the 
physical sciences. It is no less true that the strictly eco- 
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nomic and business problems falling in the other classes 
require the employment of quantitative methods. Quali- 
tative considerations enter, of course, in the solution of such 
problems, helping to determine the questions to be asked 
and the methods to be employed. But facta, measured, 
weighed and compared with other facts, constitute t he basis 
of business judgments and the foundation of economic! nsa- 
soning. Statistical methods provide means of oi'ganiiiing 
and appraising these facts. 

Of the three classes of problems distinguished in the prtv- 
ceding section two come within the scope of the present, dis- 
cussion. Though the methods of statistics arc! in part ap- 
plicable to the solution of technical problems of product ion, 
it is not the purpose of the present work f.o dcna'Iop this 
subject. For the solution of problems in l.he two other 
fields — those connected with the internal organizat ion and 
administration of business units and with the proccsssc's of 
buying and selling that bring the business man into contact, 
with the price system ■■ methods of Htatisti(!al analysis arc 
peculiarly appropriate. 

Statistical Methods and Pkoblbms of I.vi'Khnal 
Administration 

The typical business man, in the administration of his ’ 
organization, is called upon to deal with masses of meiisunv 
ments. He is dealing with tons of coal, cubic f<‘<«t. of gas, 
or kilowatt hours of energy consumed; with tons of pig iron 
or pairs of shoes produced; with machine hours uml man 
hours; with wages, costs of production and a(*lling prices 
expressed in dollars and cents. With the increasing size of 
the business unit the data with which the administrator 
must deal become increasingly complicated and numerous, 
and it becomes increasingly d^cult to determine their true 
significance. Under intuitive or rule-of-thumb methods of 
administration it is impossible effectively to analyze large 
masses of data and to control business units above the 
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average in size. It has been abundantly demonstrated that 
the law of decreasing returns comes into play in business 
largely because of administrative difficulties. 

Whenever one deals with masses of data the problem is , 
one of condensation and analysis — condensation and sim- 
plification in order that it may be possible for limited human 
faculties to handle the data, analysis (and comparison) in 
order that the elements of the problem may be distinguished 
and their significance appreciated. Statistical methods have 
been developed to facilitate the condensation and analysis 
of masses of quantitative data. 

As a typical example of such a problem may be mentioned 
the allocation of costs, an operation which has been called 
cost accoimting. The proper analysis of all the factors 
which enter into this problem is only possible through the 
use of statistical methods. Accounting methods, restricted 
to the treatment of pecxmiary units, are inadequate for the 
complete analysis of the items of expense. The anal3^is of 
sales records, again, calls for the condensation of masses of 
data, their representation in simple, understandable form, 
and their interpretation in relation to other business meas- 
urements. The analysis of markets and the study of purchas- 
ing records and commodities require the use of quantitative 
methods not restricted in their application to any one class 
of measurements. At every hand in internal administration 
statistical methods may be used to supplement accounting 
methods, to extend the knowledge of the executive, and to 
make more effective the control of business operations. 

Statistical Methods and Extbbnal Peoblbms 

New problems are encoimtered when the business man 
goes into the market to buy or sell. Continually before him 
are the phenomena of business cycles, and if he is to adapt 
his producing and marketing policies to the swings of the 
cycle he must undertake the analysis of these phenomena, 
empbying tools appropriate to the task. Again, the price 
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system, the movements of which are of such fundamental 
interest to the business man, requires analysis through the 
use of quantitative methods. So complex and numerous ure 
the data to be dealt with here that shnplification is imp(n‘{i- 
tive. Apart somewhat from the immediate interests of the 
business man, but of dominant importance to the economist , 
are all the problems connected with the economic proc(*ss of 
distribution, the allocation of income and w(«ilt.h among 
the agents of production. These, as well jis that other gn*ut 
economic problem concerned with the question of value’ or 
price determination, are quantitative problems, l.o be solved 
through the use of quantitative methods of research. 

Statistical Pbocbdurbs in Resbauch 

What are these methods, and wherein docs rc.search (em- 
ploying such methods differ from other typos of research? 
Scientific inquiry, whatever its particular nu’thod may h<;, 
proceeds through careful observation, logical infen'ucc and 
accurate verification. Quantitative methods difTer from 
others only in that observation, inference, and verification 
are based upon measuremenf. Until measunement is possible 
in a science it is unavoidable that its observations and find- 
ings should lack precision, no matter how brilliant t he flashes 
of intuition nor how painstaking the labors of it.s st mhmts 
may be. The employment of methods of ijuvusun’ment., 
making possible the smalysis of the factnm involv('d in terms 
of precise units, gives to a science .some of the advantages 
that sharp-odged tools have over blunt and unreliable instru- 
ments. Mathematics and its offspring, statistics and ac- 
counting, are the powerful instruments which the modern 
economist has at his disposal, and of which business, thj*ough 
the development of research agencies and methods, is making 
constantly greater use. 

The tools of the statistician are merely certain mathe- 
matical methods, developed for particular typos of research. 
These types of research were not economic in the original 
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development of statistical methods, but social, political, and 
anthropometric, with one line of development (that relating 
to the theory of probabilities) extending back through the 
field of logic to the gaming table. Yet these tools, developed 
for work in restricted spheres, have been found to possess 
much wider applicability, and economics has been one of 
the newer fields in which the application of these methods, 
with appropriate alterations and additions, has had fruitful 
results. The economist has found his hand strengthened 
and the precision of his work materially increased by the 
new tools. And business, together with the more abstruse 
science of econonaics, has profited. 

Reference has been made to the possibility of condensa- 
tion and simplification through the use of statistical pro- 
cedures. Such simplification is of cardinal importance in 
economics and in the other social sciences today. These 
sciences, to be realistic, must be scrupulously faithful to 
fact, yet the masses of facts relating to current social proc- 
esses are, in their magnitude, almost a menace to effective 
analysis. "Already,” writes a reviewer in the Journal of the 
Royd Staiisticd Society, "economic analysis taxes language 
to its utmost, and it is a question how much longer mere 
verbal exposition will be able to control the swelling floods 
of observable data.” Though one may feel that these floods 
of data fail to provide many of the essential facts about 
social processes today, there is point to the reviewer’s com- 
plaint. In the light of this danger systematic procedures 
in the organization and analysis of data have an importance 
today that they did not have at an earlier time. Statistical 
methods constitute such procedures. By their use we may 
seek to channel and appraise the floods of data, relating to 
business operations and other social processes, that the 
fact-gathering agencies of business and government currently 
release upon us. 
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GRAPHIC PRESENTATION 

The explanation of methods of condensing, analyzing, and 
interpreting the facts of business and economics must start, 
with the discussion of some fundamental considerations 
which are mathematical rather than Btat.iaticiil in character. 
In doing so it is deemed advisable, oven at the risk of t read- 
ing quite familiar ground, to explain certain rnathemati<“al 
conceptions to which constant reference will be made in 
later chapters. 

Statistical analysis is concerned primarily with data based 
upon measurement, expressed either in pecuniary or physical 
units. The methods of coordinate geometry, develop<‘d first, 
by the philosopher Descartes, greatly facilit.at.o t,ho manipu- 
lation and interpretation of such data. A smnmary of the 
basic principles of coordinate geometry will not be out of 
place. 


Rbctanqui/AR Coordintatjes 

If two straight lines intersecting each ot.her at right 
angles are drawn in a plane, it is po.ssible to describe the 
location of any point in that piano with reference to the 
point of intersection of the two lines. Wo will call one of 
the lines (a vertical line) Y'Y, the other line (horizontal) X'X, 
and the point of intersection (or origin) O (cf. Fig. 1). If P 
be any point in the plane, we may draw the line PM, par- 
allel to Y'Y and intersecting X'X at M, and the line PN, 
parallel to X'X and intersecting Y'Y at N. If we set OM 
equal to g units and ON equal to A units, g and h constitute 
the eoordincdaa of P, describing its location with reference 
to the origin O. Thus, in Fig. 1, g equals 6 and h equals 5. 

8 
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The distance g along the a;-axis is termed the ahsci&sa of 
the point P, while the distance h along the y-asda is termed 
the ordinate of the point P. (It is a rule of notation always 
to give the abscissa first, followed by the ordinate.) The 
coordinates of any other point in the same plane may be 



PiQ. 1. — Location of a Point with Reference to Rectangular Coordinates 


determined in the same way. Conversely, any two real 
numbers determine a point in the plane, if one be taken as 
the abscissa and the other as the ordinate. 

A point may lie either to the right or left or above or 
below the origin, 0. It is conventional to designate as 
positive abscissas laid off to the right of the origin, and as 
negative abscissas laid off to the left of the origin, while 
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ordinates are positive when laid off above the origin and 
negative when laid off below the oi'igin. In general, the 
values to be dealt with in economic statistics lie in the 
upper right-hand quadrant, where both abscissa and ordinate 
are positive. 

This conception of coordinates is fundamental in mathe- 
matics and of basic importance in statistical work. A very 
simple example will illustrate the utility of this device in 
representing business data. The figures presented in the 
following table may be employed. 

Table 1 

Production of Passenger Automobiles in tiu' United Stales, by 
Months, During the Year 1()S7 


Month 

Number of 
]Hissenger ears 

January 

manufactured 

February 


March 

40;i,S7}) 

April 

4:«>,980 

May 

425,432 

June 

411,394 

July 

3«0,403 

August 

311,450 

September 

Il«,()71 

October 

298, (Hi2 

Novcml>er 

295,328 

December 

244,385 


These data may be represented graphically on the co- 
ordinate system, months being laid off along the r-axi.s and 
number of automobiles along the y-axis, iis in the jmeom- 
panying diagram (Fig. 2). In plotting the abHcissiw, Decem- 
ber, 1936, is considered as located at the point of origin. 
The iE-value of the entry for January, 1937, is thus 1, of 
the February figure 2, etc. The coSrdinates of the point 
representing the number of cars produced in January, 1937, 
are 1 and 309,637 ; for February the values are 2 and 296,636. 
The coordinates for December are 12 and 244,385. The 
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Thousands 
of cars 



Year J937 


movement of automobile production during the year may 
be more easily followed if the points are connected by a 
series of straight lines, as is done in the figure. 


Ini>bi>bndjbnt and Dependent Variables 
In the location of any point by means of coSrdinates it 
has been pointed out that two values are involved; every 
point ties together and expresses a relation between two 
factors. In the above case these are months and number of 
passenger automobiles produced. With the passage of time 
the volume of automobile production changes, and the 
broken line shows the direction and magnitude of these 
changes. Both time and number of oars produced are vmi- 
abUs, that is, they are quantities not of constant value but 
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characterized by variations in value in the given discussion. 
Thus in Fig. 1 the abscissa has a fixed value of 6, while the 
ordinate has a fixed value of 5, but in Fig. 2 both abscissa 
and ordinate have varying values, the one varying from 1 
to 12, the other from 118,671 to 439,980. The symbols x 
and y are, by convention, used to designate such variable 
quantities as these, the former in all cases representing the 
variable plotted along the horizontal axis, the latter rep- 
resenting the variable plotted along the vertical axis.‘ 

In Fig. 2, which depicts the changes taking place in 
automobile production with the passage of time, it will be 
noted that the latter variable changes by an arbitrary unit, 
one month. Having made an independent change in the 
time factor we then determine the change in price taking 
place during the period thus arbitrarily chopped out. The 
variable which increases or decreases by increments arbi- 
trarily determined is called the independetU mrioble, iind is 
generally plotted on the a:-axis. The other variable is termed 
the dependent variable, and is plotted on the y-axis. This 
dependence may be real, in the sense that the values of the 
second variable are definitely determined by the values of 
the independent variable, or it may bo purely a conven- 
tional dependence of the type described. Time, it should 
be noted, is always plotted as independent, when it consti- 
tutes one of the variables. 

Functional Relationship 

When the relationship between two variables is one of 
complete dependence, so that the value of y is uniquely 
determined by a given value of a:, y is said to be a ftitidion 
of X. The general expression for such a relationship is 
y = /(a:)- Thus the speed at which a body is falling at a 

^ It should be noted that letters at the end of the alphabet are used ae 
symbols for variableSf while letters at the beginning of the alphabet are used as 
symbols for constants, i.e., quantities the values of which do not change in the 
given discussion. 
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given moment is a function of the time it has been falling, 
the pressure of a given volume of gas is a fimction of its 
temperature, the increase of a given principal sum of money 
at a fixed rate of interest is a function of time. If the values 
of the independent variable be laid off on the x-axis of a 
rectilinear chart and the corresponding values of the fimc- 
tion (i.e., the dependent variable) be laid oflE on the y-axis, 
a graphic representation of the function will be secured, in 
the form of a curve.* This concept of functional relationship 
is a very important one in statistical work. Some of the 
simpler functions may be briefly discussed. 

THE STRAIGHT LINE 

If two variables are so related that their values are always 
the same, their relationship is obviously of the form y = x. 
As a very simple example, the relation between the age of a 
tree and the number of rings in its trunk may be cited. 
A tree 6 years old will have 6 rings, for 20 yeai’S there will 
be 20 rings, and so on. This relationship may be represented 
on a coordinate chart, several sample values of x and y 
being taken. When these points are plotted and a line drawn 
through them, we secure a straight line passing through the 
origin and (assuming the two scales to be equal) bisecting 
the right angle XOY (cf. Fig. 3). 

Similarly, any equation of the first degree (i.e., not in- 
volving xy, or powers of x or y above the first) may be rep- 
resented by a straight line. The generalized equation can 
be reduced to the form y - a + hx, where o is a constant 
representing the distance from the origin to the point of 
intersection of the given line and the y-axis, and 6 is a con- 
stant representing the slope of the given line (that is, the 
tangent of the angle which the line makes with the hori- 
zontal). The constant term a is called the y4rUercept. It is 
clear from the generalized equation of the straight line that 

The general term “ curve ” is used to designate any line, straight or curved, 
when located with reference to a coordinate system. 



14 


STATISTICAL METHODS 


when X has a value of zero, y will be equal to this constant 
term. In the example given above (Kg. 3) a is equal to 0, 
and 6 to 1. The location of a given line depends upon the 
signs of a and b as well as upon their magnitudes. The prac- 
tical problem involved in the determination of any straight 



line is that of finding the values of a and b from the data, 
a problem which will appear in various forms in the diwsus- 
sion of statistical methods. 

These points may be illustrated by the plotting of a 
simple equation of the first degree. Thus, to construct 
the graph of the function, ?/ = 2 -f- 3a?, various values of 
X are assumed, and corresponding values of y are deter- 
mined. These may be arranged in the form of a table: 
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4 

2 

0 

2 

4 


y 

(2 + 3x) 
- 10 
- 4 
2 
8 
14 


Plotting these values and connecting the plotted points, 
the graph illustrated in Fig. 4 is secured. It will be noted 



Fig. 4. — Graph of the Equation y ■= 2 + 3x 


that since this function is linear (that is, the graph taJces 
the form of a straight line) any two of the points would have 
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been sufficient to locate the line. The jz-intercept is equal 
to the constant term 2, and the tangent of the angle which 
the given line makes with the horizontal (the slope of the line) 
is equal to 3, the coefficient of x. That this curve repre- 
sents the equation is proved by the fact that the equation is 
satisfied by the coordinates of every point on the curv'e, and 
that every pair of values satisfying the equation is represented 
by a point on the curve. It is characteristic of a lineiu’ 
relationship that if one variable be increased by a constant 
amount, the corresponding increment of the ot.her variable 
will be constant. In the above case as x grows by constant 
increments of 2, for example, the constant incrementi of the 
^-variable is 6. Series which increase in this way by con- 
stant increments are termed arUhmetic series. 

Many examples of linear relationship between variabUis 
are found in the physical sciences. An example from the 
economic world is found in the groAvth of mon<;y at simple 
interest, that is, interest which is not compounded. If we 
let r represent the rate of simple int.ore.st, ;r the number of 
years, and y the sum to which one dollar will amount at 
the end of x years, the equation of relationship is of the 
form 

y - I + rx. 

Since in a given case r will be constant, this is of the simple 
linear type. In statistical work precise relationships of 
this type rarely if ever occur, but approximations to the 
straight line relationship are found constantly. 

NON-LINBAK EBLATIONSHIP 

Non-linear ftmctions are of many types, of which only 
a few of the more common will be dkicussed here. The 
student should be familiar with the general characteristics 
of the chief non-periodic curves, of which the parabolic 
and hyperbolic types, on the one hand, and the exponential 
type on the other, are the most important. The potential 
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series is mentioned as a more general form of rather wide 
utility. Of periodic functions the sine curve is briefly de- 
scribed, as a fxmdamental form. 

Functional relationships of the parabolic or hyperbolic 
form are quite common in the physical sciences, and such 
curves are foxmd to fit certain classes of economic data. 
The general equation, when there is no constant term, is 
of the form y = ox*. The curve is parabolic when the ex- 
ponent h is positive, and hyperbolic when b is negative. 
The two following examples will serve to illustrate these 
types: 

Problem: To construct the graph of the function y = x*. 


X 

y 


(a:®) 

- 5 

25 

- 4 

16 

-3 

9 

- 2 

4 

- 1 

1 

0 

0 

1 

1 

2 

4 

3 

9 

4 

16 

6 

25 


The graph is shown in Fig. 5. 

Problem: To construct the graph of the function y * x-^, 
for positive values of x. 


X 

i 

1 
1 

2 

3 

4 

5 


y 

(arO 

3 

2 

1 

i 

i 

i 

i 
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Fig. 5. — Parabola: Graph of the Equation (/ = 

The graph of this function, an ccjuilateral hyperbola, is 
shown in Fig. 6. It should be noted that this equation 

may also be written y = i or a:y = 1. 

It is characteristic of relationships of this type that jis x in- 
creases in geometric progression, y also incrcjxscs in geomet ric 
progression. Thus, in the example of the parabola given above 
iy = a:®), if we select the x values which form a geometric 
series,^ the corresponding y values form a similar series: 

a: 1 2 4 8 16 32 

y 1 4 10 64 256 1,024 

Another class of functions is of the form represented by 
the equation y - aJf. In equations of this type one of the 
variable quantities occurs as an exponent; graphs repre- 

> A geomotrio series is one each term of which is derived from the proooding 
term by the applioation of a constant multiplier. 
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Fig. 6. — Equilateral Hyperbola: Graph of the Equation y = 

(for positive values of x) 

senting such equations are called exponential curves. The 
example which follows illustrates the type. 

Problem: To construct the graph of the function y - 2*, 
for positive values of a;. 

r. y 

(2*) 

0 1 

1 2 

2 4 

3 8 

4 16 

5 32 

6 64 
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This gi'aph is shown in Fig. 7. 

It has been noted that the relationship between two 
variables which increase by constant increments (consti- 
tuting arithmetic series) may be represented by a straight 
line, and that the relationship between variables increasing 



Fia. 7. — Exponential Curve: Graph of the Equation y => 2' (for jKwi- 

tive vdues of x) 

in geometric progression may be represented by either a 
parabola or a hyperbola. The exponential curve constitutes 
a hybrid type. It describes a relation in which one variable 
increases in arithmetic progression while the other increases 
in geometric progression. The figures given above illustrate 
this relationship. 
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Curves based upon relationships of the following type have 
been employed extensively in statistical inquiries: 

y—a+hz-\-cx^+dx^+ .... 

The term potential series has been applied to equations of 
this type. Though such cm-ves do not constitute parabolas 
of the strict conic section type, a curve based upon such 
an equation carried to the second power of x is termed a 
second degi’ee parabola, to the third power of x, a third 
degree parabola, etc. No uniform and simple type is secm'ed 
from this series. It is treated in more detail at a later point. 

Periodic functions constitute another distinct type, a class 
represented notably by electrical and meteorological rela- 
tions, though not confined to these fields. The character- 
istic feature of such relationships is that values of the de- 
pendent variable repeat themselves at coixstant intervals 
of the independent variable. The sine curve, the basic type 
of this class, is illustrated in the following example. 

Problem: To construct the graph of the function j/ =■ sin®. 


X 

y 

(angle in degrees) 

(sin x) 

0° 

.000 

30" 

.600 

60" 

.866 

90" 

1.000 

120" 

.866 

150" 

.500 

180" 

.000 

210" 

- .600 

240" 

- .866 

270" 

- 1.000 

300" 

- .866 

330" 

- .500 

360" 

.000 

390" 

.500 


etc. 


The graph is shown in Fig. 8. 
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The full importance in statistical work of securing a 
mathematical expression for the relation between two vari- 
ables cannot be demonstrated until the subject has been 
firrther developed. One fundamental object is the detcr- 
noination of physical or economic laws underlying observed 
phenomena. Another more practical object is the securing 



of a formula by means of which values of one variable may 
be approximated from given values of the other. Examples 
throughout the book will serve to illustrate how these objects 
are attained.^ 


Logaeithms 

Logarithms, which play such an important part in gen- 
eral mathematical operations, are of equal importance in 
the manipulation of the raw materials of statistics. The 
nature of logarithms, and the methods by which they are 
employed to facilitate arithmetic processes, may be briefly 

^ A fuller discussion of different curve typos is presented bolow, in the section 
dealing with the analysis of time series. 
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reviewed. This discussion is concerned only vnth the common 
S 3 ^tem of logarithms of which the base is 10. 

Any positive number may be expressed as a power of 10. 
Thus 

1,000 = 10 X 10 X 10 = 10» 

10,000 = 10 X 10 X 10 X 10 = 10*. 

In each case the exponent of 10 (the small number written 
above and to the right) indicates the number of times the 
figure 10 is repeated as a factor. For the integral powers 
of 10 the exponent is a whole number, but for the other 
numbers the exponent will contain a fractional value. Thus 
100 is equal to 10 raised to the power 2, or 10®; 110 is equal 
to 10 raised to the power 2.04139, or 10®“*“®. 

The exponent of 10, or the index of the power to which 10 
must be raised to equal a certain number, is called the logor 
rithm of that number. The logarithm of 100 is 2, the logarithm 
of 110 is 2.04139, the logarithm of 998 is 2.99913. These 
figures all have reference to the base 10, though a system 
of logarithms might be developed on any base. In general, if 

log6 0 = c 

which may be read “the logaaithm of a to the base b is 
equal to c.” The relation between the given number, the 
base and the logarithm, when the common system of logar 
rithms is employed, may be easily remembered if the follow- 
ing relations are kept in mind: 

100 = 10® 
logio 100 = 2. 

The bgaxithm of any number has two parts, the integral 
and the decimal. The whole number is called the eJiorao- 
teristic, and the decinoal portion is termed the marAissa. 
The former is determined in a given case by inspection, 
while the mantissa may be obtained from logarithmic tables. 
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The characteristic varies with the location of the decimal 
point, while the mantissa remains the same for any given 
combination of numbers. This fact is illustrated by the 
following figures: 

log of 8,450 = 3.92686 

log of 845 = 2.92686 

log of 84.5 = 1.92686 

log of 8.45 = .92686 

log of .845 = 9.92686 - 10 

log of .0845 = 8.92686 - 10. 

In finding the natural number to which a given logarithm 
corresponds (such natural numbers are termed miti-Loga- 
rithms), the mantissa determines the sequence of figures, 
while the whole number, or characteristic, determines the 
location of the decimal point. For example, in seeking the 
anti-logarithm of 2.17609 it is found that the decimal 
.17609 follows the natural numbei' 1,500, in a table of 
logarithms. Since the characteristic is 2, the natural num- 
ber desired must lie between 100 and 1,000, and must, there- 
fore be 160. 

A brief study of the following figures, showing the pro- 
gression of numbers corresponding to certain powers of 10, 
will help to fix in mind the relatioixs between the multiples 
of 10 and their logarithms, and will enable the characteristic 
of a desired logarithm to be readily determined. 

.0001 .001 .01 .1 1 10 100 1,000 10,000 

10 -‘ 10 -» 10 -* 10 -> 10 ® 10 ' 10 * 10 * 10 < 

The exponents of 10 in the lower row are the logarithms 
of the numbers in the upper row. 

It should be noted that the logarithms of all numbers 
from 0 to 1 are negative. Thus the logarithm of .846 is 
- 1 -f .92686; this is written 9.92686 - 10. In covering 
the range of all positive natural numbers from zero to infin- 
ity, logarithms traverse all positive and negative values. 
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A negative natural number, therefore, can have neither a 
positive nor a negative logarithm. 

The advantage of thus expressing numbers as powers of 
10 lies in the fact that the ordinary arithmetic operations of 
multiplication, division, raising to powers and extracting 
roots are greatly facilitated by this procedrire. 

To multiply numbers, add their logarithms. The sum 
of the logarithms of the factors is the logarithm of their 
product. In general terms: 


0* X a' = a^\ 

Specifically 

10* X I0» = (10 X 10) X (10 X 10 X 10) « 10* = 100,000 
100 X 1,000 = 100,000. 


To divide one number by another, subtract the loga> 
rithm of the latter from the logarithm of the former. The 
remainder is the logarithm of the desired quotient. 

In general terms: 

■.5- o' « 

Specifically 


10» -s- 10* = ^ == 10* = 1,000 


10 X 10 


100,000 -J- 100 


1 , 000 . 


To raise a given number to any powm:, multiply the 
logarithm of the number by the index of the power. The 
product is the logarithm of the desired power. 

In general terms: 

(o»)« = 0“. * 

Specifically 

(10»)» - (10 X 10 X 10) X (10 X 10 X 10) = 10* = 1,000,000 

1 , 000 * = 1 , 000 , 000 . 

To extract any root of a given number, divide the loga- 
rithm of the number by the index of the root The quo- 
tient is the logarithm of the desired root. 



STATISTICAL METHODS 


36 


In general terms: 


Specifically 

= 10 ^ = = 100 

1 ^ 1 , 000,000 = 100 . 


In summary: 


log (a X b) = log o + log 6 
log (a -f- 6) = log a — log b 
log a‘ = 6 X log 0 
log ij'a = log a T b. 


These characteristic advantages of logarithms have been 
made use of in the construction of the slide rule, an instru- 
ment for reducing routine toil which should be familiar to 
all students of statistics. 


LOGAHITHMIC EQUATIONS 

The graphic rcprcsentation of data by means of a system 
of rectangular coordinates hsis been described above and 
some of the advantages of this method have been outlined. 
For many purposes it is desirable to plot logarithms rather 
than the natural numbers themselves. This may result in 
bringing out significant relations more distinctly, or it may 
serve greatly to simplify and facilitate the manipulation of 
data. In particular, when it is possible through the use of 
logarithms to reduce a complex curve to the straight line 
form, a distinct gain has been made in the direction of 
simplicity of treatment and interpretation. 

A linear equation, it will be recalled, is of the general 
form y a + bx, where o and b are constants which meas- 
ure, respectively, the j/-intercept of the given line and the 
slope. The simplification of equations through the use of 
logarithms involves in all cases the substitution of log x or 
log y, or both, for the x .or y variables, thereby reducing an 
equation of a higher order to a simpler form. 
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This process may be illustrated with reference to the 
equation y = x®. When plotted on rectangular codrdinates 
this equation gives a curve of the parabolic type (cf. Fig. 5). 


Natural Numbers 



Scale of Logarithms 


Fio. 9. --Graph of the Equation logy * 2 log x (Logarithmic form of 
the equation y =< x*) 

Reduced to logarithmic form this becomes log j/ = 2 log x. 
This equation, in which the variables are log y and log x, 
is linear in form. It is plotted in Fig. 9, for positive values 
of log X. To indicate the relations involved, natural numbers 
corresponding to the logarithms are given on scales to the 
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right and at the top of the figure. The natural numbers 
appearing on the scales constitute geometric series, while 
their logarithms form arithmetic series. Equal vertical 
distances on the chart, it will be noted, represent equal 
absolute increments on the scale of logarithms and equal 
percentage increments on the scale of natural numbers. 

The equation y = fix* can be reduced in the same way 
to log 2/ = log fi + 3 log X, a linear form. Similarly, all 
equations of the type y = ax*, that is to say, all simple 
parabolas and hyperbolas, can be reduced to the straight 
line form log y = log a + 6 log .r. Graphically this means 
plotting the logarithms of the r/’s against the logarithms of 
the x’s. 

A different problem is presented by an equation of the 
type y = afi*, the graph of which is termed an exponential 
curve. Expressed in logarithmic foim, vre have log y = 
log o + X log 6. This is also of the linear type, the two con- 
stants being log a and log h, while the variables are x 
and log?/. If we plot the natural x’s and the logs of 
the 2 /’s, with this type of equation, a straight line will Ikj 
secured. A curve of this type is discussed and illustrated 
below. 


LOGARITHMIC AND SEMI-LOGARITHMIC CHARI'S 

There are certain disadvantages to the plotting of loga- 
rithms, however. If a considerable number of point-s are 
being plotted the task of looking up the logarithms may be 
tedious, and, in addition, the origiiuil values, in which 
chief interest lies, will not appear on the chart. These 
difficulties may be avoided by constructing charts with the 
scales laid off logarithmically, but with the natural numbers 
instead of the logarithms appearing on the scales. This is 
an arrangement identical with that employed in the con- 
struction of slide rules. Thus, although the natural numbers 
are given on the scales, distances are proportional to the 
logarithms of the numbers thereon plotted. In Fig. 10 
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such a chart is presented, showing the graph of the equa- 
tion y = x^. 

A variation of this type of chart which is of great im- 
portance in statistical work is one which is scaled arith- 
metically on the horizontal axis and logarithmically on the 
vertical axis. This is equivalent, of course, to plotting the 



Pio. 10. — Graph of tlie Eqxiation y = x* (Plotted on paper with 
logarithmic scales) 

x’a on the natural scale and plotting the logarithms of the 
y’a. As was pointed out above, such a qombination of 
scales reduces a curve of the exponential type to a straight 
line. Plotting paper of this semi-logarithmic or “ratio” 
type may be constructed with the aid of a slide rule or 
of logarithms, or may be purchased ready made. It is of 
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particular value in charting economic statistics, because of 
the fact that time is usually one of the variables in such 
cases, and it is desirable to plot this variable on the natural 
scale. 

Dollars 



Fia. 11. — The Compound Interest Law: Growth of $10.00 at Compound 
Interest at 6 per cent for 100 Years (Plotted on arithmetic scale) 


As an example of this type of curve the compound inter- 
est law may be used. If r be taken to represent the rate 
of interest, x the number of years, p the principal, and y 
the sum to which the principal amounts at the end of x 
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years (interest being compounded annually), an equation is 
secured of the form 

y = p(l + r)*. 

Expressed logarithmically this becomes 

log j/ == log p + r log (1 + r), 
the equation to a straight line. 

Dollars 



i'lo. 12. — The Compound Interest Law: Growth of SIO.OO at Com- 
pound Interest at 6 per cent for 100 Years (Plotted on semi-logarithmio 
or ratio scale) 


In Fig. 11 a curve representing the growth of ten dollars 
at compound interest at 6 per cent is plotted on the natural 
scale. This is the graph of the eiqponential equation 

y = 10(1 -h .06)* 

y representing the total amount of principal and interest 




32 STATISTICAL METHODS 

at the end of x years. Figure 12 shows the same data plotted 
on semi-logarithmic paper, the exponential curve being re- 
duced to a straight line. 

The use of semi-logarithmic paper is not confined to 
cases in which an exponential curve is straightened out, 
for the significance of many types of data is most effectively 
brought out when charts of this type axo used. These 
advantages are more fuUy explained below. 

The Construcoton op Charts 

When the results of observations or statistical invj'sfi- 
gations have been secured in quantitative foim, one of the 
first steps toward analysis and interpretation of the data 
is that of presenting these results graphically. Not only is 
such procedure of scientific value in paving the way for 
further investigation of relationships, but it. serves an im- 
mediate practical purpose in visualizing the results. .\ visual 
stimulus opens up a far more direct path to our undcrsland- 
ing and imagination than that afforded by the more recently 
developed processes of reasoning. The interpretation of a 
column of raw figures may be a difficult task; the same data 
in graphic form may tell a simple and easily understood story. 
For these reasons graphic methods of presentation have come 
to play a highly important part in the everyday activities 
of business, as well as in the laboratory and drafting room. 

It is beyond the scope of this book to present any detailed 
account of the multiplicity of graphs employed by engineers 
and statisticians today. Certain of the more important 
principles of graphic presentation may be briefly explained, 
however, and some of the chief types of graphs which are 
in daily use may be illustrated. Other examples appear in 
later chapters of this book. 

FACTORS aOVBRNING THE SELECTION OP A CHART 

The selection of the type of chart to be employed in a 
given case will depend upon two general considerations. 
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The first of these relates to the character of the material 
to be plotted. While the data of a given problem may 
frequently be presented graphically in several different 
forms, there is generally one type of chart best adapted 
to that material. It may be true, also, that certain types 
would be quite inappropriate to the data in question. The 
selection of a type of chart to employ, therefore, must 
be made with the characteristics of the data cleaiiy in 
mind. 

Perhaps more important is the "purpose which the given 
chart is designed to serve. Each of the noany t3q)es 
of charts in common use is appropriate to certain speci- 
fic purposes. It will bring out certain characteristics of 
the data or will emphasize certain relationships. There 
is no chart which is sovereign for all purposes. Until the 
purpose is clearly defined the best chart form cannot 
be selected. The following descriptions of a few stand- 
ard types will facilitate the selection of an. appropriate 
form. 

CHAKl'S ADAPTED TO THE PLOTTING OP TIME SEMES 

In the graphic presentation of a time series, primary 
interest attaches to the chronological variations in the values 
of the data, to the general trend and to the fluctuations 
about the trend. If the purpose is to emphasize the absolute 
variations, the differences in absolute units between the 
values of the series at different times, a simple chart of the 
type illustrated in Fig. 13 will serve the purpose. This 
chart depicts annual wheat flour exports from the United 
States during the period 1913-1936. Both scales are arith- 
metic. Points representing the various annual values are 
shown and, to facilitate interpretation, these points are 
connected by a series of straight lines. The chart tells a 
simple story of year-to-year fluctuations, with a diarp ad- 
vance at the end of the World War, a decline as the post-war 
emergency passed, several years of moderate growth, and a 
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Fio. 13. — Wlieat Flovir Exports from the United Ktatos, 


severe decline as the world dcprcsKion deepened in the early 
thirties. With respect to general make-up, the following 
points should be noted: 

1. The title constitutes a clear description of the material plotted 

and indicate® the period cov(?red. 

2. The vortical scale be©ns at the line, enabling a true 

impression to be gained of the mngnituch* of the fluctua- 
tions. 

3. The zero line and the line joining the plottf'd |)oints are niled 

more heavily than the codrdinate lines. 

4. Figures for the scales are placed at the left and at the Ijottom 

of the chart. The vertical scale may be repeated at the 
right to fadlitate reading. All figures are so placed that 
they may be read from the base as bottom or from the 
right hand edge of the chart as bottom. 
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5. The y-values of the plotted points are given at the top of the 
chart. This practice is helpful, though not necessary, as 
the values may be presented in a separate table. 

ADVANTAGES OP THE RATIO CHART 

If relative rather than absolute variations are of chief 
concern, the chart employed should be of the semi-loga- 
rithmic type, scaled logarithnaically on the y-BJos and arith- 
metically on the r-axis. In such a chart equal percentage 
variations are represented by equal vertical distances, as 
opposed to the ordinary arithmetic type in which equal 
absolute variations are represented by equal vertical dis- 
tances. The argument for the use of the semi-logarithmic 
or ratio chart for the representation of time series is that, 
in general, the significance of a given change depends upon 
the magnitude of the base from which the change is meas- 



ured. That is, an increase of 100 on a base of 100 is as 
significant as an increase of 10,000 on a base of 10,000. In 
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each case there is an increase of 100 per cent. The absolute in- 
crease in the second case is 100 times that in the first case, 
and the two changes would show in this proportion on the 
arithmetic chart. They would show as of equal importance 
on the semi-logarithmic chart. 

Such a chart is presented in Fig. 14, which shows the 
course of steel production in the United States from 1896 
to 1936. The absolute magnitudes are plotted, but the 
vertical scale is so constructed as to represent variations 
from year to year in proportion to their relative magnitude. 
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Pig. 15. — Exports of the United States, 1920-1936 Showing Total Ex- 
ports and Exports to Selected Areas (Monthly averages for the years 
named are plotted on an arithmetic scale) 

Certain distinctive advantages of the ratio or logarithmic 
ruling are brought out by a comparison of Fig. 15 and 
Fig. 16. The data presented graphically in these two charts 
are shown in Table 2 : 
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Table 2 


Exports of the United States, 1920-1936 
(Monthly averages, in thousands of dollars) 



To 

To 

To 

To 

Total to 

Grand 


France 

Oemiany 

Italy 

U Tt%t6€L 
Kingdo7n 

Europe 

total 

1920 

$56,349 

$25,953 

$30,980 

$161,319 

$372,174 

$685,668 

1921 

18,745 

31,027 

17,955 

78,510 

196,992 

373,753 

1922 

22,247 

26,343 

12,575 

71,319 

173,613 

319,315 

1923 

22,678 

26,403 

13,961 

73,527 

174,451 

347,291 

1924 

23,472 

36,702 

15,695 

81,912 

203,775 

382,582 

1925 

23,358 

39,195 

17,096 

86,155 

216,979 

409,154 

1926 

22,000 

30,347 

13,117 

81,051 

192,512 

400,722 

1927 

19,066 

40,140 

10,971 

70,005 

192,576 

405,448 

1928 

20,058 

38,938 

13,510 

70,613 

197,912 

427,363 

1929 

22,133 

34,204 

12,831 

70,667 

195,070 

435,083 

1930 

18,663 

23,189 

8,369 

56,509 

153,198 

320,265 

1931 

10,152 

13,838 

4,668 

37,923 

95,040 

202,024 

1932 

9,297 

11,139 

4,095 

24,027 

65,358 

134,251 

1933 

10,143 

11,669 

5,103 

26,978 

70,816 

139,583 

1934 

9,642 

9,062 

5,381 

31,896 

79,150 

177,733 

1935 

9,751 

7,665 

6,036 

36,117 

86,770 

190,240 

1936 

10,795 

8,382 

4,900 

36,662 

86,694 

204,457 


(Data compiled by Dureau of Itoreign and Domestic Commerce, XT. S. 
Department of Commerce.) 


If the six series are to be presented on a single chart, 
scaled arithmetically, a scale must be selected which will 
include the largest item recorded, $685,668,000. Such a 
scale reduces the relative importance of the smaller magni- 
tudes. From Fig. 15 it appears that during the period cov- 
ered by the chart very large fluctuations occurred in total 
exports, substantial but somewhat smaller movements oc- 
curred in exports to Europe, and that exports to the four 
individual countries suffered much less severe fluctuations. 
Such a picture is quite misleading. The true state of affairs 
is reflected in Fig. 16, in which the same data are plotted 
on paper with a semi-logaiithmic ruling. Fluctuations in 
exports to the individual countries are here seen to have 
been relatively greater than the movements of total exports. 
For the purpose of comparing series which differ materially 
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Fiq. 10. Exports of the tiuted states, 1920-1936, Showing Total Exports and Exports to Selected Areas, with 
Scal^ of Increase, Decrease, and Comparison (Monthly averages for the years named are plotted on a semi- 
logarithmic scale) 
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with respect to the magnitude of the individual items, the 
arithmetic ruling is quite useless, giving a thoroughly dis- 
torted picture of the true relations. The ratio r uling permits 
a legitimate comparison. 

The scales printed below Pig. 16 emphasize certain very 
useful features of the logarithmic ruling. The scale of in- 
crease may be used to measure with a fair degree of accu- 
racy the increase in a given series between any two dates. 
A given vertical distance on the chart, it will be recalled, 
represents a constant percentage increase at all points on 
the chart. Thus the distance from 1 to 10, along the vertical 
scale, is the same as the distance from 100 to 1,000. Any 
vertical distance may be measured, and the percentage of 
increase which it represents may be determined by laying 
off the given distance along the scale of increase, which is 
always read from the bottom up. For example, to determine 
the degree of increase in total exports from 1932 to 1935, 
we measme the vertical distance between the points plotted 
for these two years. Laying off this distance along the scale, 
it is foimd to represent about a 40 per cent increase. 

The scale of decrease is used in a similar fashion. The 
vertical distance between any two points is measxured, and 
the percentage decrease ^hich it represents is determined 
by laying off the given distance on the scale from the top 
dovmward. The arrows indicate the direction in which the 
various scales axe to be read. 

By means of the scale of comparison the percentage relation 
of one series to another at any time may be determined. 
For example, we may wish to know the percentage relation 
between exports to Europe and total exports in 1935. The 
vertical distance between the two plotted points is measured, 
and laid off on the scale of comparison, reading from the top 
downward. It is foxmd to be approximately 45 per cent. 

Scales of the type illustrated above may be readily con- 
structed on a given chart by using the ratio ruling for the 
scale intervals. When a series of charts is prepared on 
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semi-logarithmic paper of a standard type it is convenient 
to construct such scales in a more permanent form, in the 
shape of special rulers. 

A ratio chart is particularly useful when interest attaches 
to rates of growth (or decline) over a considerable period of 
time. In such a case, the reading of the chart is facilitated 
by the plotting of straight diagonal lines indicating uniform 



lio. 17. — Production of Rayon Filament Yam in the United States, 
1912-1937, With bines Defining Uniform Rates of (Jrowth 


rates of change. These should radiate from a single point of 
origin. The procedure is illustrated in Fig. 17. The diagonal 
lines there shown indicate changes at uniform rates ranging 
from 10 per cent to 50 per cent per year. By reference to 
these lines the user of the chart may readily determine the 
approximate rate of growth of the plotted scries between 
any two years. 

The chief advantages of the semi-logarithmic ruling in 
chart construction may be briefly summarized: 

1. A curve of the exponential typo bccomos a straight line when 
plotted on a semi-logarithmic chart. For example, a curve 
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representing the growth of any sum of money at compound 
interest takes the form of a straight line when so plotted. 

2. In any series, so long as the rale of increase or decrease remains 

constant the graph will be a straight line on this ruling. 

3. Equal relative changes are represented by lines having equal 

slopes. Thus two series increasing or decreasing at equal 
rates will be represented by parallel lines. 

4. Comparison of the rates of change in two or more series is 

effected by comparison of the slopes of the plotted lines. 

6. The semi-logarithmic ruling permits, at the same time, the 
plotting of absolute magnitudes and the comparison of 
relative changes. 

6. Comparison of series differing materially in the magnitude of 

in^vidual items is possible with the semi-logarithmic chart. 

7. Percentages of change may be read and percentage relations 

between magnitudes determined directly from the chart. 


CHARTS FOR THE COMPARISON OP FREQUENCIES 

A different type of chart is called for when the object is 
the comparison of frequencies, that is, numbers of events 
or things of different classes. The following census figures 
may serve to illustrate the problem. 


Table 3 


Farm in New England States in 1935 
State 


Maine 

New Hampshire 
Vermont 
Massachusetts 
Rhode Island 
Connecticut 


Number of farm 
41,907 
17,696 
27,061 
36,094 
4,327 
32,167 


A graphic comparison of these six states withrespect to num- 
ber of farms in 1935 is afforded by the bar diagram in Fig. 18. 
This is a simple but effective type of chart for this purpose. 

Further examples of this type of chart, as employed in 
the representation of frequency distributions, are contained 
in the next chapter. It is there shown how a frequency 
polygon or frequency curve may grow out of the simple 
bar diagram, when data of certain kinds are being handled. 
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Pig. 18. — PannB in New England States in 1935 


Such frequency curves constitute very important graphic 
types, but it will be more appropriate to treat them in 
full at a later point. 

CHAETS FOE THE EBPEESBNTATION OP COMPONENT PAKT8 

It is frequently desirable in tabular and graphic presenta- 
tion to break up a total into its component parts, in order 
that changes in the parts as well as in the total may be fol- 
lowed. The table on page 43 exemplifies this procedure. 

These figures are presented graphically in Fig. 19, which 
reveals the varying post-war fortunes of different interests 
in American manufacturing industries. It is clear from the 
diagram that the general swings of material costs, labor costs 
and overhead costs in American manufacturing industries 
have paralleled the fluctuations in total value of products. 
Some of the movements of the component items are of ex- 
ceptional interest, however. Overhead costs (with which 
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Table 4 


Total Value of Products and Elements of Production Costs^ 
Manufacturing Industries of the United States, 
1919-1935 


(Millions of dollars) 


Year 

Cost of 

LcAor cost 

Overhead cost 

Total value 

materials^ 

(wages) 

plus profits ® 

of products 

1919 

$37,233 

$10,462 

$14,347 

$62,042 

1921 

25,321 

8,202 

10,130 

43,653 

1923 

34,706 

11,009 

14,841 

60,556 

1925 

35,936 

10,730 

16,048 

62,714 

1927 

34,803 

10,836 

16,639 

62,278 

1929 

38,178 

21,681 

11,607 

20,176 

69,961 

1931 

7,173 

12,184 

41,038 

1933 

16,821 

5,262 

9,276 

31,359 

1935 

26,264 

7,545 

11,951 

46,760 


70.000 

60.000 

50.000 

& 

0 40.000 
o 

g 30,000 

20.000 
10.000 

0 



Fig. 19. — Total Value of Products and Elements of Production Costs, 
Manufacturing Industries of the United States, 1919-1935 


^ Including containers, fuel and purchased electric energy. 

* This item represents the difference between total direct costs (materials 
and wages) and total value of products. It includes overhead costs prop^, 
plus salaries, taxes, profits, etc. 




44 


STATISTICAL METHODS 


profits are here combined) showed a notable expansion be- 
tween 1921 and 1929. The great recession that followed 
squeezed all the elements of the total, forcing them to levels 
well below those of the 1921 depression. 


CTTMULATIVE CHARTS 

In many cases chief interest in the development of a 
series attaches not to the value of each successive item but 
to the cumulated total of a number of such items. This 
may be so when a yearly production program has been 
laid out. In such a case it is the relation between cumu- 
lated production to date and scheduled production to date 
which is of major interest, and a chart form is needed 
which will enable this comparison to be made. The fol- 
lowing figures illustrate the type of data for which such 
charts are appropriate. 


Table 6 


Ctamdative Production Schedule and Cumvlatm 
Output, 1936 

(Speedwell Automobile Company) 


MonOt, 

Production 

schedule 

(cars) 

Cumulaiive 

production 

schedule 

{cars) 

Output 

{cars) 

January 

8,000 

8,000 

6,126 

February 

10,000 

18,000 

9,250 

March 

12,000 

30,000 

10,514 

April 

15.000 

14.000 

45,000 

15,131 

May 

59,000 

12,169 

June 

12,000 

71,000 

13,260 

July 

11,000 

82,000 

11,462 

August 

10,000 

92,000 

10,531 

September 

6,000 

98,000 

4,621 

October 

9,000 

107,000 

9,843 

November 

10,000 

117,000 

13,785 

December 

10,000 

127,000 


Curmdatm 

outptit 

(cars) 

6 , 12 ,') 

15,376 

25,389 

41,020 

53,179 

00,429 

77,891 

88,422 

93,043 

102,880 

116,671 


It is assumed that this table represents the mtuation as 
of the end of November. 
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In Fig. 20 the two cumulative curves are plotted. The 
relation between actual and scheduled production at the 
end of each month is shown on the chart, and it is possible 
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PtG. 20. — Comparison of Scheduled and Actual Output (Cumulative) 
Speedwell Automobile Co. 1936 


from the scale to read the approximate amount by which 
production is behind schedule. By reference to the figures, 
which should always accompany the chart, the exact rela- 
tion may be determined. Such a chart has many ap- 
plications, some of which are illustrated in the following 
chapter. 
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THE GANTT PROGRESS CHART 

The same data may be presented in a very effective form 
by Tna.1fing use of a type of chart developed by Mr. H. L. 
Gantt. An adequate description of this chart and of its 
many uses would far exceed the space which can be given 
to it here, but its characteristics may be indicated in a very 
brief account. 

Once a schedule has been drawn up, the Gantt chart 
may be utilized in checking actual accomplishment against 
the schedule. Having such a schedule as that given in 
Table 5, the monthly and aimual quotas may be entered 
on a form similar to that shown in Fig. 21. The entry to 
the left of each monthly space indicates the amount sched- 
uled for production during that month. The entry to the 
right of each monthly space indicates the cumulated sched- 
uled production to the end of the given month. In this 
figure the results of the first two months’ operations are 
shown. The heavy black line indicates the cumulated 
actual production during this period, amounting to 15,375 
cars. The narrow upper lines in the January and February 
columns measure the actual production in each of those 
months. If actual production in either month had equaled 
the scheduled production the light line would eoctend across 
the full monthly space. When acilual production in a given 
month exceeds the scheduled production a double light line 
appears. 

It should bo noted that the spaces into which each monthly 
period is divided represent equal time intervals hut varying 
amounts in terms of actual production. Thus the apace 
representing one fifth of the January interval represents a 
production of 1,600 cars (the January quota being 8,000). 
The space representing one fifth of the April interval repre- 
sents 3,000 cars (the April quota being 15,000). In readii^ 
the chart in terms of aWlute magnitudes reference must be 
had to the monthly quotas. 



PROGRESS CHART ^ OF FEB. 



47 


ilG. 22. — Comi)anson of Scheduled and Actual Output, 1936; Gantt Ptogress Chart (Showing the situation on 

NovembCT 30th) 
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The situation at the end of November is shown in Kg. 22. 
The arrow at the top of the diagram indicates the point of 
time actually reached. That actual production is slightly 
behind scheduled production is apparent from the relation 
between this arrow and the heavy black line, while the light 
lines indicating monthly production show that actual output 
has exceeded the monthly quota m five of the last six 
months. 

The Gantt chart has a great variety of applications in 
governmental and business organizations. The economy 
of space is such that developments m a number of depart- 
ments or districts may be shown on a single chart. It 
constitutes the simplest and most effective graphic method 
known for following the progress of work under way, for 
comparing actual accomplishment with an established pro- 
gram. And in so doing, it increases by so much the efficiency 
of administrative control. 

Pebfebred Practice for Graphic Presentation 

Graphic methods have been widely employed in the physi- 
cal and social sciences and in business, and the resulting 
diversity of uses has made it difficult to secure standardiza- 
tion of practice. To remedy this defect a committee repre- 
senting engineering, statistical and research organizations 
was organized in 1929, under the sponsorship of the American 
Society of Mechanical Engineers, for the purpose of formulat- 
ing principles of preferred practice in this field. This group, 
acting as a Sectional Committee of the American Standards 
Association, is compiling a code of preferred practice for 
graphic presentation. The first section of this code, dealing 
with time series charts, has been issued by the sectional com- 
mittee. This report furnishes an excellent summary of con- 
ventional procedures, with detailed recommendations con- 
cerning principles appropriate to the graphic presentation of 
time series. Somewl^t more specialized, although it deals 
with certain principles applicable to the entire fiield of 



REFERENCES 


graphics, is another report of the same committee on charts 
suitable for use as lantern slides.^ 
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The situation at the end of November is shown in Fig. 22. 
The arrow at the top of the diagram indicates the point of 
time actually reached. That actual production is slightly 
behind scheduled production is apparent from the relation 
between this arrow and the heavy black line, while the light 
lines indicating monthly production show that actual output 
has exceeded the monthly quota in five of the last six 
months. 

The Gantt chart has a great variety of applications in 
governmental and business organizations. The economy 
of space is such that developments in a number of depart- 
ments or districts may be shown on a single chart. It 
constitutes the simplest and most effective graphic method 
known for following the progress of work under way, for 
comparing actual accomplishment with an established pro- 
gram. And in so doing, it increases by so much the efficiency 
of adnoinistrative control. 

Prbfbbeed Practicb poe Geaphic Presentation 

Graphic methods have been widely employed in the physi- 
cal and social sciences and in business, and the resulting 
diversity of uses has made it difiGicult to secure standardiza- 
tion of practice. To remedy this defect a committee repre- 
senting engineering, statistical and research organizations 
was organized in 1929, under the sponsorship of the American 
Society of Mechanical Engineers, for the purpose of formulat- 
ing principles of preferred practice in this field. This group, 
acting as a Sectional Committee of the American Standards 
Association, is compiling a code of preferred practice for 
graphic presentation. The fiist section of this code, dealing 
with time series charts, has been issued by the sectional com- 
mittee. This report furnishes an excellent summary of con- 
ventional procedures, with detailed recommendations con- 
cerning principles appropriate to the graphic presentation of 
time series. Somewhat more specialized, although it deals 
with certain principles applicable to the entire field of 
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graphics, is another report of the same committee on charts 
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CHAPTER HI 


THE ORGANIZATION OF STATISTICAL DATA: 

THE FREQUENCY DISTRIBUTION 

The task of the statistician engaged in business or eco- 
nomic research includes the organization, analysis and in- 
terpretation of quantitative data relating to business affairs 
and to economic conditions. To these fundamental operas- 
tions that of collecting the original data may be added, 
though more frequently data will be compiled directly from 
primary or secondary sources. 

At the outset it is necessary to distinguish between the 
problems arising in the analysis of time scries and those 
involved in the organization and analysis of materials in 
connection with which the time factor does not enter. In 
studying a time series the primary object is to measure and 
analyze the chronological variations in the value of the 
variable. Thus one may study variations in sales over a 
period of years, fluctuations in the production of bituminous 
coal, or changes in the general level of prices. Quite differ- 
ent is the procedure in the study of such a problem as 
income distribution at a given time. In this case wo arc 
desirous of knowing how many people in the United States 
fall in each of a number of income chisscs. The general 
problem of organization in this latter class of cases is to 
determine how many times each value of a variable is re- 
peated and how these values are distributed. Data of this 
sort, when organized, constitute & frequency aeriee, as opposed 
to the time or hiatorical aeries. The methods appropriate 
to these two types of analysis differ fundamentally and will 
therefore be treated separately. In the present section wo 
are concerned with the organization and preliminary analy- 
se 
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sis of data in connection with which the time element, 
while it may be present, does not enter as a factor. 

Unorganized Data 

When quantitative data of the type with which the 
statistician works are presented in a raw state they appeal’ 
as unorganized masses of material, without form or struc- 
ture. They may have been drawn from the production or 
sales records of a busiaess establishment, or they may 
represent a miscellaneous collection of price quotations. If 
the data have been gathered by other agencies they may 
already have been arranged in the form of a general table, 
but this form may be entirely unsuited to the particular 
object in the mind of the investigator. The first task of 
the statistician is the organization of the figures in such 
a form that their significance, for the purpose in hand, may 
be appreciated, that comparison with masses of similar 
data may be facilitated, and that further analysis may be 
possible. Scientific method, it has been noted, involves 
observcMon, inference, and verification. Data, the results of 
observation, must be put into definite form and given 
coherent structure before the process of inference is possi- 
ble. 

The figures on page 52, representing the earnings during 
a given week of 210 individuals engaged in piece work in a 
certain manufacturing establishment, will serve as an ex- 
ample of such data in their raw state. 

The Array 

If these figures are arranged in order of magnitude some- 
thing will have been done toward securing a coherent 
structure. The range covered and the general distribution 
throughout this range will then be clear, and the way will 
be prepared for further organization. When so arranged 
the array on page 53 is secured. 
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Weekly BIabnings of 210 Employees 


126.25 

$ 28.70 

$ 24.15 

$ 29.75 

$ 29.20 

$ 30.60 

$ 23.40 

$24 75 

26.70 

24.35 

25.75 

27.20 

28 30 

25.25 

27.75 

27.60 

28.20 

27.30 

27.80 

26.35 

27 40 

28.30 

26.60 

25.76 

27.70 

28.60 

25.30 

27.80 

26.40 

27.30 

28.35 

27.00 

24.30 

27.80 

27.60 

26.30 

27.40 

23.50 

29.60 

27.80 

27.60 

25.35 

27.55 

29.00 

24.10 

27.00 

24.50 

27.26 

26.15 

29.30 

23.10 

27.10 

28.50 

27.45 

26.15 

28.35 

27.95 

25.55 

27.55 

26.60 

24.25 

30.00 

28.55 

28.00 

27.30 

27.90 

25.25 

24.10 

27.45 

24.55 

26.55 

27.55 

26.75 

31 00 

24.00 

25.35 

26.50 

28.30 

27.95 

25.55 

30.25 

28.55 

26.75 

24.60 

25.75 

26.55 

27.80 

28.90 

29.55 

30.00 

24.60 

25.75 

26.30 

27.00 

28.25 

25.26 

25.75 

26.25 

26.30 

26.75 

27.90 

28.30 

25.70 

26.30 

26.60 

27.00 

30.75 

28.60 

28.10 

23.50 

24.75 

25.15 

26.30 

27.25 

28.15 

29.10 

30.10 

29.90 

28.55 

27.30 

26.55 

27.55 

23.00 

24.50 

22.85 

26.55 

27.55 

28.10 

30.70 

28.60 

27.90 

26.80 

24.10 

25.25 

26.30 

27.90 

26.90 

25.30 

25.80 

28.85 

27.55 

27.30 

25.00 

26.00 

26.55 

27.80 

28.60 

30.55 

29.50 

24.10 

25.15 

27.16 

28.10 

26.30 

27.10 

24.60 

27.80 

26 30 

27.90 

29.80 

24.10 

25.15 

27.50 

24.25 

25.70 

26.80 

30.15 

29.30 

28.15 

28.65 

24.55 

25.85 

26.10 

27.00 

26.80 

27.55 

29.00 

23.00 

28.60 

29.30 

28.55 

28.80 

27.55 

23.60 

26.10 

27.15 

25.75 

26.80 

27.15 

26.30 

28.56 

25.80 

24.55 

25.80 

26.75 

27.30 

27.65 

28.25 

25.60 

26.30 

26.85 

28.60 

27.30 

26.00 

28.10 

32.00 

28.15 

26.30 

27.75 

26.25 


Frequency Tables 

While the array presents the figures in a shape much 
more suitable for study than the haphazard distribution 
first shown, there is still something to be desired before the 
mind can readily grasp the full significance of the data. 
The factory manager may see that the smallest amount 
earned during the week was $22.85, that the largest amount 
earned was $32.00, and that most of the employees earned 
between $25.00 and $20.00, but this is still a vague descrip- 
tion of the data. By a process of grouping, that is, by 
putting into common classes all individuals whose earnings 
fall within certain limits, a simplified and more compact 
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Akbay: Weekly Earnings op 210 Employees 


$ 22.85 
23 00 

23.00 

23.10 

23.40 

23.50 
23 50 

23.60 

24.00 

24.10 

24.10 

24.10 

24.10 

24.10 

24.15 

24.25 

24.25 

24.30 

24.35 

24.50 

24.50 

24.55 

24.65 

24.55 

24.60 

24.60 

24.60 

24.76 

24.76 

25.00 


$ 25.15 

25.15 

26.15 

25.25 

25.25 

25.25 
25 25 

25.30 

25.30 

25.35 

25 35 

25.55 

25.55 

25.60 

25.70 

26.70 

26.76 

25.75 

26.76 

26.76 

25.75 

25.75 

25.80 

26.80 

25.80 

25.85 

26.00 

26.00 

26.10 

26 10 


$ 26.15 
26 15 
26 25 

26.25 

26.25 
26 30 
26 30 

26.30 

26.30 

26.30 
26 30 
26 30 
26 30 

26.30 
26 30 
26 30 

26.35 
26 40 

26.60 

26.65 
26 56 

26.55 

26.55 

26.55 

26.60 

26.60 
26 60 

26.70 

26.75 

26.75 


$ 26.75 

26.75 

26.80 

26.80 

26.80 

26.80 

26.85 

26.90 

27.00 
27 00 

27.00 

27.00 

27.00 
27 10 

27.10 

27.16 
27 16 

27.15 

27.20 

27.25 

27.25 

27.30 

27.30 

27.30 

27.30 

27.30 

27.30 

27.30 

27.40 

27.40 


$ 27.45 
27.45 

27.50 

27.55 

27.55 

27.65 

27.65 

27.55 

27.65 

27.65 

27.65 

27.65 
27 60 

27.60 
27 60 

27.70 

27.76 
27 76 

27.80 

27.80 

27.80 

27.80 
27 80 

27.80 

27.80 

27.90 
27 90 
27 90 

27.90 
27 90 


$ 27.96 
27.95 

28.00 

28.10 
28 10 

28.10 

28.10 

28.15 

28.16 

28.15 

28.20 

28.25 

28.25 

28.30 

28.30 

28.30 

28.30 

28.35 

28.35 

28.50 

28.55 

28.55 

28.55 

28.55 

28.55 

28.60 

28.60 
28 60 

28.60 

28.60 


$ 28.60 

28.65 

28.70 

28.80 

28.85 

28.90 

29.00 

29.00 

29.10 

29.20 

29.30 

29.30 

29.30 

29.50 

29.55 

29.60 

29.76 

29.80 

29 90 

30.00 

30.00 

30.10 

30.15 

30.25 

30.55 

30.60 

30.70 

30 76 

31.00 

32.00 


presentation of the wage distribution may be obtained. The 
following table shows the results of this grouping process 
when the range of each class (the dass-irUerval) is two 
dollars. 

This table presents a condensed summary of the original 
figures, a summary which not only gives us the approximate 
range of the earnings, but shows, also, how the earnings of 
the 210 workers are distributed throughout this range. 
There has been a considerable loss of detail, it will be 
noted. 
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Table 6 

Frequency DistribvMon of Employees 
(Clasaiiied on the basis of weekly earnings [dass-interval — $2]) 


Weekly earnings 

Number earning 
(freqUB’. 

$22.00 to $23.99 

8 

24.00 to 25.99 

48 

26.00 to 27.99 

96 

28.00 to 29.99 

47 

30.00 to 31.99 

10 

32.00 to 33.99 

1 


210 


From this table we may learn that there axe 48 persons who 
earned during the given week between $24.00 and $2/5.09, 
but we caimot learn how the earnings of the 48 individuals 
were distributed throughout this range of two dollars. All 
may have earned exactly $24.00, so far as we may know 
from the figures shown in the table. This loss of detail is 
an inevitable accompaniment of the condensation and sim- 
plification which the process of classification involves. 

If the size of the class-interval be decreased the loss of 
detail is less pronounced, though the increiise in the number 
of classes means a more cumWsome table and one whicih 
presents a more complex picture to the eye. The tables 
on page 56 present the same data, classified with intervals 
of one dollar, fifty cents, and twenty-five cents. 

The four tables wc have thus constructed represent four 
different degrees of condensation of the same data. Tables (5, 
7, and 8 present the same general characteristics: a small 
number of cases in the extreme classes and a more or less 
regular increase in the frequencies as the center of each of 
the distributions is approached. The departure from reg- 
ularity becomes greater the greater the number of classes. 
Table 9, in which the class-interval is 26 cents, has 38 classes. 
In this table the distribution of cases throughout the range 
is highly irregular, with pronounced departures from sym- 
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Frequency DxsTRiBxmoNS of Eiiplotbbb 
(Classified on the basis of weekly earnings) 


Table 7 
(Class-interval 

= $1) 

Table 8 

(Class-interval = 50 cents) 

Table 9 
(Class-interval — 

26 cents) 

Weekly 

FTe ~ 

Weekly 

Fre- 

Weekly 

Fre- 

earnings 

quency 

earnings 

qiiency 

earnings 

quency 

S22. 00 to $22.99 

1 

$22.50 to $22.99 

1 

$22.75 to $22.99 

1 

23.00 to 23.99 

7 

23.00 to 23.49 

4 

23.00 to 23.24 

3 

24. 00 to 24.99 

21 

23.50to 23.99 

3 

23 25 to 23.49 

1 

25. 00 to 26.99 

27 

24.00 to 24.49 

11 

23 50 to 23.74 

3 

26.00 to 26.99 

42 

24.50 to 24.99 


23. 75 to 23.99 

0 

27.00 to 27.99 

54 

25.00 to 25.49 

12 

24 00 to 24.24 

7 

28.00 to 28.99 

34 

25.50 to 25.99 

15 

24. 25 to 24.49 

4 

29.00 to 29.99 

13 

26.00 to 26.49 

22 

24. 50 to 24.74 

8 

30.00 to 30.99 

9 

26.50 to 26.99 


24 75 to 24.99 

2 

31.00 to 31.99 

1 

27.00 to 27.49 

24 

25 00 to 25.24 

4 

32.00 to 32.99 

1 

27.50 to 27.99 

30 

25. 25 to 25.49 

8 


210 

28.00 to 28.49 

17 

25. 50 to 25.74 

6 



28. 50 to 28.99 

17 

25 75 to 25.99 

10 



29. 00 to 29.49 

7 

20. 00 to 20.24 

6 



29.50 to 29.99 

6 

26. 25 to 26.49 

16 



30. 00 to 30.49 

5 

26. 50 to 26.74 

10 



30. 50 to 30.99 

4 

26. 75 to 26.99 

10 



31.00 to 31.49 

1 

27. 00 to 27.24 

11 



31.50 to 31.98 

0 

27. 25 to 27.49 

13 



32.00 to 32.49 

1 

27. 50 to 27.74 

14 




210 

27.75 to 27.99 

16 





28. 00 to 28.24 

9 





28. 25 to 28.49 

8 





28. 50 to 28.74 

14 





28. 75 to 28.99 

3 





29. 00 to 29.24 

4 





29 25 to 29.49 

3 





29. 50 to 29.74 

3 





29. 76 to 29.99 

3 





30.00 to 30.24 

4 





30.26 to 30.49 

1 





30.60 to 30.74 

3 





30.76 to 30.99 

1 





31.00 to 31.24 

1 





31.26to 31.49 

0 





31.60 to 31.74 

0 





31.75to 31.99 

0 





32.00 to 32.24 

1 






210 


metry. The structure of each of the other tables is orderly 
and approaches more closely a condition of symmetry. Each 
presents the wage data in condensed and compact form, so 
that one conBulting the tables may learn of the size and 
distribution of weekly earnings in the factory in question 
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much more readily than by reference to the chaotic collec- 
tion of figures first shown. Such organized collections of 
data are termed frequency distribiUions, and their purpose, 
as the term implies, is to show in a condensed form the na- 
ture of the distribution of a variable quantity throughout 
the range covered by the values of the variable. The con- 
struction of such a table is the first step to be taken in the 
organization and analysis of quantitative data of the type 
represented above. 

STEPS IN THE CONSTRUCTION OP A FREQUENCY TABLE 

This general introduction to the subject of frequency 
tables has left imtouched many important matters in con- 
nection with their construction. It remains to present a 
summary statement of these details. It will be clear that 
the fibst step here taken, the arrangement of the items in 
order of magnitude, is unnecessary in the actual construc- 
tion of such a table. Having determined the upper and 
lower limits through an inspection of the data, one has 
but to decide on the number of classes desired, write the 
class-intervals on an appropriate blank sheet, and proceed 
to tally the cases falling in each of the classes thas set oil. 
When this process is completed the frequencies are com- 
puted and the totals arranged in tabular form of the type 
illustrated above. These simple operations involve decisions 
on a number of points, however. 

SIZE OP CLASS-INTERVAL 

In deciding upon the size of the class-interval (which is 
equivalent to deciding upon the number of classes) one 
fundamental consideration should be borne in mind, namely, 
that classes should be so arranged that there will be no 
material departure from an even distribution of cases within 
each class. This arrangement is necessary because, in inter- 
preting the frequency table and in subsequent calculations 
based upon it, the mid-value of each class is taken to repre- 
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sent the values of all cases falling in that class. Thus, in 
basing calculations upon Table 8, it is assiuned that the 
22 cases falling between $26 . 00 and $26 . 50 may aU be repre- 
sented by the mid-value of that class, $26.25. This assump- 
tion will seldom be strictly vaJid. In the case just cited 
reference to the original figures will show that it is not a 
correct assumption. Absolute accuracy would only be ob- 
tained by having a class for every value represented in the 
original figures. Since condensation is necessary an arrange- 
ment of classes should be seemed which will minimize the 
error involved, without transgressing other requirements. 
Table 6 furnishes an example of class-intervals too wide 
for the material. 

The requirement which has just been described clearly 
calls for a large number of classes. A second requirement, 
which ordinarily conflicts with this, is that the number of 
classes should be so determined that an orderly and regular 
sequence of frequencies is seemed. If the classification is 
too narrow for the data regularity will not be attamed in 
this respect, and a table without structme or order will be 
seemed. Table 9 fails to meet this requirement, as has been 
pointed out. It is desirable, also, that the number of classes 
be limited in order that the data may be easily manipulated 
and their significance readily grasped. 

A useful procedme for approxunating a suitable class- 
interval has been suggested by H. A. Stmges. Given a 
series of N items of known range, a suitable class-interval i 
may be approximated from the formula 

, _ Range 

* ” 1 +3.322 log V’ 

The specific figure seemed in a given instance is likely to 
be a fractional value, quite unsuited to actual use. An 
appropriate round nxmaber close to the theoretical value, 
may be chosen.^ Thus, in the example cited above, with a 

1 This formula) and the justification for its usC) are discussed in ^‘The Choice 
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range of 19 . 15 and N equal to 210, the uae of a cloee-interval 
of $1.05 is indicated by the formula. The nearcat round 
number, suitable witth reference to other considerations as 
well, is $1.00. Table 7, in which this class-interval is em- 
ployed, seems to conform most thoroughly to all the re- 
quirements we have set forth. 


LOCATION OF CLASS LIMITS 


The location of class limits is a matter of considorable 
importance, for attention to this niatitnr will simplify tal)- 
ulation and facilitate later calculation. Tabulat ion of data 
is easiest when class linoits arc integers and the c]asH-int<orva] 
itself is a whole number. Calculation of aventgos and other 
statistical measures is fadlitatcd when t.ho mid-values of 
classes are integers. Suitable class limits tvnd mid-points 
are usually secured when the data permit class-intervals of 5 
or multiples of 5 to be employed, though such an arrange- 
ment is by no means essential. 

Some types of data ^ow a tendency to cluster or con- 
centrate about certain values on the scale along whi(‘h they 
are distributed. This is illustrated by the following iigures 
which form port of a table showing the number of pieces 
of commercial paper discounted by the Federal Reserve 
Banks in 1921, dhitributed according to rates of discount 
or interest charged by member banks: 


Rate (per miO 
6 
61 
61 
6i 
7 

n 


Number oj piece* 
18.070 
607 
4,616 
1!ir> 
17,362 
10 


of a Gan lutwviJ" by Hwbeit A. Stuiges, Journal tfihoAnurvm StaRoUeal 
AMoaofion, Maroh, 65-6. TIwuseofthefomulaxaiitoontheaMRimpitlon 
that tha ptqper diatributloa Into elaana is g^ven, for all numboa vUoh are 
powers of 2, by a series of binomial eoefBoients. The relation of the terms in 
the Wnomial expansion to the theory of flrequenoy rOetrlbutions to dtoaissed 
below, In Chaptw XIIL 
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Here is a quite obvious bunching about the integers, with 
a secondary concentration at each half of one per cent. 
No cases at all fall between the quarter values here shown. 
It is clear that in classifjdng such data the mid-points of 
the various classes should fall at those values about which 
the cases are concentrated, and class limits must be located 
with this end in view. For, as noted above, calculations 
based upon the frequency table are performed upon the 
assumption that all the items in each class are concentrated 
at the mid-point of that class. Thus, if a class interval of 
one half of one per cent were selected in the above example, 
the classes should extend from 5f to (but not including) 
6i, 6i to 6f, etc., rather than from 6 to bj, to 7, etc. 


ACCURACY OF OBSERVATIONS AND THE DEFINITION 
OF CLASSES 

In the construction of frequency tables it is essential 
that there be a clear definition of classes, so that there may 
be no imcertainty as to their range and no question as to 
the precise class in which a given case falls. A table with 
an arrangement similar to the following is sometimes en- 
countered: 


Class-interval 


Frequency 


0 to 10 3 

10 to 20 8 

20 to 30 15 

30 to 40 6 

40 to 50 2 


In the absence of explanation, a question arises at once as 
to whether a case with a value of 10 would fall in the first 
or in the second class. It is highly desirable that the range 
of each class be indicated in some such way as the following, 
in order that this ambiguity may be avoided: 
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Class-inierval Frequency 

Oto 9.9 3 

10 to 19.9 8 

20 to 29.9 15 

30 to 39.9 6 

40 to 49.9 2 

This procedure solves the difficulty, however, only in case 
the observations are accurate to the nearest tenth. If the 
observations are accurate only to the nearest unit (that is, 
if the cases recorded as having a value of 10 actually lie 
between 9.5 and 10.5) a mere change in the description of 
the class range does not solve the problem of allocating a 
case at the class limit. In such a case an observation falling 
at a class boimdary may be cut in two, one half being allo- 
cated to each of the adjacent classes. 

Yule lays down the useful principle that in fixing a class 
boundary the limit should be carried to a farther place in 
decimals, or a smaller fraction, than the values of the indi- 
vidual cases as originally recorded. Thus, in the preceding 
example, if observations were correct to the nearest tenth, 
it would mean that a value recorded as 9.9 act.ually lay 
between 9 . 85 and 9 . 95. In accurately describing the classes, 
therefore, the intervals should be given as 0 to 9 . 95, 9 . 95 
to 19.95, etc. (Since the observations to be tabulated are 
recorded only to the first decimal place no ambiguity arises 
from the apparent over-lapping of these class limits.) It 
should be noted that the values of the mid-points, with 
these class limits, would be 4.95, 14.95, etc. In presenting 
and using the table as given above the real meaning of the 
class limits should be borne in mind. In all cases class 
boundaries must be fixed with reference to the accuracy of 
the observations. 

The work of tabulation is simplified if, in designating a 
class, both limits are stated, as above. Errors are likely if 
only the lower limit of each class is given, or if the mid- 
point alone is designated. It is desirable, however, par- 
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ticularly if calculations are to be based upon the table, to 
include a separate column showing the values of the mid- 
points of the various classes. 

OTHER BEQUIBEMENTS 

Class-intervals should be uniform throughout the table 
in order that all classes may be comparable. Occasionally 
tables are published with varjdng class-intervals, so that 
on one section of the scale the number of items falling 
within a class having an interval of 5 is given, and on 
another section of the scale the number of items falling 
within a class having a range of 10 is given. Obviously, 
comparison of classes is impossible. It may be desirable 
to show in more detail the cases falling within certain 
ranges on the scale, but this end is best achieved by the 
construction of a supplementary table relating only to the 
cases falling within this restricted section. The utility of 
the main table is not lessened thereby. 

Similar in nature is the requirement that there should be 
no indeterminate classes, that is, classes the ranges of which 
are not defined. Had all the individuals maldng $30.00 and 
over in the illustration of piece-work earnings been entered 
in a class with the designation “$30.00 and over,” the 
upper limit of this class would have been quite uncertain. 
This fault in a table is a vital one when it is desired to base 
calculations upon the data contained in the table. When 
there are several extreme cases the inclusion of such classes 
is sometimes unavoidable, but when this is done the actual 
values of the cases included in such “open end” classes 
should be given in a footnote to the table. 

The errors described in the two preceding paragraphs 
are exemplified in the table on page 62. 

In this case the ranges of the two “open end” classes are 
not known. The ranges of the intermediate classes vary, 
being -$5.00 for two classes, $10.00 for one class, $20.00 
for one class and $25.00 for two classes. The purposes of a 
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Table 10 

Frequency Distribution of Rented Dwellings in Reno, Nevada, 1934 
• (Claadfied on the l)asia of rental value 


Monthly rental 

Number of dwellings 
in each da as 

Under $10.00 

{fnyuenci/) 

327 

$10.00 to $14.99 

349 

15.00 to 19.99 

521 

20.00 to 29 99 

1,039 

30.00 to 49 99 

1,075 

50.00 to 74 99 

189 

75.00 to 99.99 

24 

$100.00 and over 

9 


3,533 


special investigation may sometimes be served by the use 
of such a form, but a table of this type is poorly adapted to 
the requirements of statistical calculation. 

The Struci'uke of Statistical Tables 

The preceding discussion has been confined to certain 
more or less technical problems which lurise in the construc- 
tion of a frequency table. Nothing has been said directly 
as to the fonn of the completed table, the arrangement of 
columns and rows, the title, the notation. No general prin- 
ciples of tabular arrangement have been laid down. While 
no detailed treatment of these principles is poasible within 
the scope of the present disetassion, (tertain general consid- 
erations relating to the structure of statistical tables may be 
suggested. 

The statistical table is merely a device for presenting in 
summary fashion a mass of quantitative data. Unless the 
summary be clear, significant, concise, and readily inter- 
preted nothing has been gained by the process of tabulation 

^ The table is taken from Red Propmty Inventory, W$4- Summary and Sv^y^ 
Four Cities Combined, Department of Commerce^ Washington. Figures for 
255 rented dwellinge in Heuo were not reported. 
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and classification. A sprawling, formless table is like a 
rambling, unintelligible discourse. There must be a purpose 
in back of each table, and this purpose should be clearly- 
brought out in its arrangement. The means by which this 
purpose may be attained in a given case must be deter- 
mined with reference to the specific conditions affecting that 
case, but standard practices should be followed, in so far 
as possible. The following general principles will be found 
helpful in deciding upon the form and arrangement of sta- 
tistical tables: 

1. The title should constitute a clear, concise and complete 

description of the material assembled in the table. 

2. Headings of columns and rows should be concise and unambigu- 

ous. 

3. Variable quantities should increase from left to right and 

from top to bottom, when such arrangement is feasible. 

4. Columns and rows may be nmnbered to facilitate reference to 

the table. 

6. The units of measurement employed should be clearly indicated. 

6. Sources should be given in all cases. 

7. The table should constitute a unit, self-suflSicient and self- 

explanatory. All explanations necessary for its interpre- 
tation should be included as integral parts of the table, or in 
the form of footnotes. 

Graphic Representation op Frequency Distributions 

Frequency distributions of the type illustrated above serve 
a very important statistical function in presenting a compact 
summary of data, and in preparing these data for further 
manipulation. Such distributions may be presented not 
only in tabular form, but graphically, utilizing the general 
principles of the coordinate system which were e^lained 
above. Many of the characteristic features of a frequency 
distribution are most clearly revealed when the graphic 
method is adopted. 

Table 6, presenting the weekly earnings of 210 employees, 
with a class-interval of two dollars, is depicted graphically 
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in Fig. 23. In this figure class-intervals are plotted along 
the a>*axis and the corresponding class-frequencies along the 
y-axis, appropriate scales beiog sdected. The fact should 
be noted that the scale of abscissas starts not with 2fero, 
but with $20. For convenience in presentation that part 



of the scale extending from 0 to $20 is omitted. The student 
diould bear this in mind in seeking to secure a correct 
impression of the rdations between the two variables plotted. 
In constructing such a figure, which is termed a coZtmn 
diagram or hieiogram, short horizontal lines are drawn con- 
necting the points plotted to represent the upper and lower 
limits of each dass-mterval. In interpreting this diagram 
it should be noted that the areas of the different rectang^ 
are proportional to the number of cases represented, the 
total area representing the entire 210 cases. This device 
thus presents to the eye a very clear incture of the distribu- 
tion, showing quite unmistakably the relative number of 
workers falling in each of the wage classes. 

The classes in this case are so laige, however, that some 



THE COLUMN DIAGRAM 


65 


violence is done to the facts. So many details are lost 
that a true conception of the disposition of the items is not 
given. Fig. 24 is a histogram depicting the distribution of 
cases when a class-interval of one doUar is used. In this 
case, with smaller steps, we approach more closely an orderly 
and symmetrical distribution. The same is true of Fig. 25 



on the Basis of Weekly Eaminp (Class-interval $1.00) 


which shows the distribution when the class-interval is 
fifty cents. The distribution represented in Fig. 26 has a 
class-interval of twenty-five cents which, as has been pointed 
out, is too narrow for the data, with the result that a quite 
irregular structme is secured. (It should be noted that 
the vertical scale is not the same in these four figures, so 
that comparison with respect to class-frequencies is only 
possible by reference to the scale figures.) 




Fig. 25. — Column Diagram: Distribution of 210 Employees ("lassified 
on the Basis of Weekly Earnings (Class-iuterval - S 50) 
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Class Interval $.25 
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Pig, 26. — Column Diagram; Distribution of 210 Employees Classified 
on the Baaia of Weekly Earnings (Clasa-interval « $.25) 
m 






on the Basis of Weekly Eaminp (Class-interval =« $2 00) 



Eig. 28. — Frequency Polygon: Distribution of 210 Employees Classified 
on the Basis of Weekly Earnings (Class-interval 81.00) 

67 
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Frequency polygons corresponding to the histograms of 
Figs. 23, 24, and 25 are shown in Figs. 27, 28, and 29. 
Each of these polygons has been constructed by plotting as 
abscissas the mid-points of the class-intervals, and as ordi- 
nates the class-frequencies, the points thus secured being 



22 23 24 25 26 27 28 29 30 31 32 33 


Dollars 

Fio. 29. — EVequency Polygon: Distribution of 210 EmployMW (’kssifu'd 
on the Basis of Weekly Earnings (Class-interval $.50) 

connected by a broken line. In completing such a figure 
the class next below the lowest one on the scale and the 
class next above the highest one on the scale are included, 
the class-frequency being zero in each case. The ends of 
the polygon thus connect with the base line at the mid- 
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points of these two extra classes. In the case of the fre- 
quency polygon the entire area under the curve represents 
the entire number of cases, but the area of a given interval 
cannot be taken to be proportional to the number of cases 
in that interval, because of irregularities in the distribution 
on either side of the given class. The heights of the ordi- 
nates at the mid-points of the various classes are, of course, 
scaled to represent the class-frequencies. 

The Smoothing of Cttbvbs 

Attention is again called to the results secured with 
varying class-intervals. As the class-interval is decreased, 
up to a certain point, the histograms and polygons become 
smoother and more regular. Beyond that point breaks 
begin to appear in the data; the regular change in class- 
frequencies which was foimd when the classes were larger 
is broken by the appearance of irregular classes which 
seem to depart from the general rule. In Rg. 25 these 
have become quite pronounced. Such irregularities, it is 
obvious, are exceptions to a general rule which seems to 
prevail, the general rule that the numbers of workers falling 
within the different wage classes increase from the lower 
limit of earnings up to a ma,ximum in the neighborhood of 
$27.50, and then decrease till, at the upper limit of $32, 
but one worker is found. Since all the 210 individuals are 
engaged in the same work, and since their earnings depend 
only upon their rapidity and skill, one would e:^ect a quite 
regular increase and decrease. If we had figures not for 
one week only, but for 52 weeks, and took the average 
weekly earnings of each of the 210 workers for the year, 
we should expect greater regularity with the smaller class- 
intervals than is actually foimd, since the accidental fluc- 
tuations peculiar to one week alone would thus be elimi- 
nated. Or, if we had earnings during one week for 10,920 
workers (52 times 210), the same result would be secured. 
Thus, if regularity and smoothness are to be secured, it is 
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essential not only to decrease the size of the classes but 
also to increase the number of cases, in order that the 
accidental iiTegularities which affect a small number of 
observations may be eliminated. A refined classification 
with a small number of cases leads to the condition exempli- 
fied in Fig. 26. But such an increase in the number of cases 
is, in general, a practical impossibility. We wish, if pos- 
sible, to develop a feasible method of approximating the 
distribution which would be secured with very small class- 
intervals and a very large number of cases. Such an approxi- 
mation is possible through the device of curvt^-smoothing. 
By this method we may secure a smooth frequency curve 
which lacks the irregularities occasioned by minor fluctu- 
ations. 

Such a smooth frequency curve serves to represent the 
true underlying distribution of the data. It was pointed 
out that areas in the frequency polygon am not propor- 
tional to the number of cases included, the cause lying in 
the irregularities of the data. In a smoothed frequency 
curve these irregularities have boon eliminated, and the 
area between ordinates erected at given points on the scale 
of abscissas is assumed to be proportional to the theoretical 
frequency of cases between the given values. Moreover, a 
smooth trend having been established, frequencies for in- 
termediate values not shown in the original table may be 
determined by interpolation.* 

The following data,® representing the distribution in 1918 
of personal incomes below $4,000, will serve to exemplify 
the smoothing process. 

* The Uxoitatione of praoticol Rtatiotical work are such that there most of 
necessity be many gaps in the data. The given values of tlie varial^ are not 
continuous. Inteipoktion is the process of cstinmting values of a variable 
quantity between given values, or of locating a point on a curve between given 
points. That interpolation is most accurate which leads to estimated values 
having the highest degree of consistency with tlie given values. 

• Prom Vol. I, Income in the United States, National Bureau of Eoouomie 
Res»rch. New York, Harcourt, Brace & Co., 1921, 132-33. 
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Table 11 


Distnbviion of Income among Personal Income Bedjnents in 1918 
(Including all personal incomes below $4,000) 


Income - 


$ 0 

to 

$100 

100 

to 

200 

200 

to 

300 

300 

to 

400 

400 

to 

500 

500 

to 

600 

600 

to 

700 

700 

to 

80 Q 

“sob 

to 

900 

900 

to 

1,000 

1,000 

to 

1,100 

1,100 

to 

1,200 

1,200 

to 

1,300 

1,300 

to 

1,400 

1,400 

to 

1,600 

1,600 

to 

1,600 

1,600 

to 

1,700 

1,700 

to 

1,800 

1,800 

to 

1,900 

1,900 

to 

2,000 

2,000 

to 

2,100 

2,100 

to 

2,200 

2,200 

to 

2,300 

2,300 

to 

2,400 

2,400 

to 

2,500 

2,600 

to 

2,600 

2,600 

to 

2,700 

2,700 

to 

2,800 

2,800 

to 

2,900 

2,900 

to 

3,000 

3,000 

to 

3,100 

3,100 

to 

3,200 

3,200 

to 

3,300 

3,300 

to 

3,400 


Number of persons * 
62,809 
103,704 
209,087 
489,963 
961,991 
1 , 549,974 
2 , 154,474 
■ 2 , Q 88.466 

3 , 013,034 
! 3 , 144,722 
^ . 3 , 074,351 
rN ; 2 , 850 , 626 ' 

;• 2 , 535,286 

• 2 , 205,728 

t , 1 , 832,230 
1 , 612,649 
/> 1 , 234,397 
/ ^ 999,996 
S 811,236 
7 663,789 
< 549,787 

1 463,222 
" 395,115 

2 340,141 
^ 295,490 
^ 268,660 

227,731 
*- 201,488 
178,901 
154,499 

* 142,802 

• 128,217 
, 115,683 
, 104,504 


* The definition of classes used is equivalent to “|0 to and not including 
$100," etc. Thus an individual with an income of $100 would fall in the second 
d^BS. 

* The National Bureau's report states “The numbers below are ^ven to the 
nearest unit. It is not pretended that such arithmetic aocuraqy is anything 
more than technical." 
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Table 11 — Continued 


Distribiction of Incmne among Personal Income Recipients in 1918 


Income class 
$3,400 to 3,500 
3,500 to 3,600 
3,600 to 3,700 
3,700 to 3,800 
3,800 to 3,900 
3,900 to 4,000 


Number of persons 
$94,803 
86,405 
79,023 
72,562 
66,900 
61,894 


Figures 30, 31, and 32 present column diagrams of these 
income data, grouped with class-intervals of S500, $200, and 


12,000,000 
10,000,000 
I 8,000,000 

0 

3 

cr 

1 6,000,000 
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2 . 000 . 000 


500 1000 1500 2000 2500 3000 3500 4000 

Ooflars 

Rg. 30. — I Column Diagram: Distribution of PotHonal Income IlecipientB 
in the United States, 1918. Including all Recipients of Incomes below 
$4,000 (Class-interval =i $500) 

$100. As the class-interval is decreased the histograms be- 
come more regular and uniform, but our original data 
permit us to carry this process only to the point where the 
class-interval is $100. Our problem is to determine the 
imderlying distribution which the data approximate more 
and more closely as the class-interval is lessened. If we 
replace the broken line of the histogram by a smooth curve 
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Fig. 31. — Column Diagram: Distribution of Personal Income Redpients 
in the United States, 1918. Including all Recipients of Incomes below 
84,000 (Class-interval = |200) 



Dollars 

Fig. 32. — Column Diagram; Distribution of Personal Income Recipients 
in the United States, 1918. Including all Recipients of Incomes below 
$4,000 (Class-interval = $100) 
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enclosing the same total area as the histogram and so drawn 
through the points of the histogram that the area cut from 
each rectangle is approximately eqxuil to the area added to 
the same rectangle by the curt>e, we will have a frequency 



Dollars 


Fio. 33. — Prequeney Curve: Distribution of Personal Income llecipients 
in the United States, 1918. lucludiuK all Recipients of Ijicomes below 
$4,000 (Derived from the eoluum diagram with class-inUffval of $100) 

curve representing the desired distribution. The require- 
ment that the same total area bo enclosed is fundamental. 
Exceptions to the rule concerning the area of individual 
rectangles will frequently occur because of the existence of 
quite irregular classes, but tis a general working principle 
it is helpful. (More refined methods of fitting a smooth 
cm’ve to data will be discussed at a later point, but a process 
of smoothing by inspection such as that described above 
gives a fairly close approximation to the required curve.) 

Figure 33 illustrates the result of smoothing the histo- 
gram of income distribution shown in Fig. 32. Here the 
quite artificial jumps between income classes are smoothed 
out, and we secure the graduation by infinitesimal incre- 
ments which we should expect to find when the incomes erf 
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so many millions of persons are included. Here we have 
that which we desired — an approximation to the true un- 
derlying distribution, with the sharp breaks resulting from 
the method of classification eliminated. 

CONTINUOUS AND DISCEETB SERIES 

The logical validity of the smoothing process is dependent 
upon the nature of the data being manipulated. Erom 
this point of view frequency series of the type discussed 
above may be divided into two classes, contimums series 
and non-cantimious or discrete series. A continuous series is 
one in which the values of the independent variable in- 
crease or decrease by increments which are infinitely small. 
A discrete series is one in which the phenomena represented 
by the independent variable always change in value by 
definite amounts. The curve of underlying values rises not 
smoothly, as for the continuous series, but by jumps. 

The fact should be emphasized that in making this dis- 
tinction we are speaking of the values as they would be found 
in the underlying universe of phenomena from which the ac- 
tual bodies of material we study are drawn. Any given 
sample, whether representing continuous or discrete series, 
will be marked by breaks in the values of the independent 
variable. This will be true, in the case of a continuous 
series, because of the limitations upon the instruments and 
senses we use in measuring. Thus if the heights of 100 men 
be measured, the independent variable of the frequency 
series (height) will increase by finite amounts. We may 
measure to the nearest inch, or perhaps to the nearest 
eighth of an inch. Yet if ten thousand or ten million men 
were arranged in order of height the differences between 
successive individuals would be infinitely small. Height is a 
continuous variable, even though the values found in a given 
sample are marked by discontinuity. 

Quite different is the distribution of such a variable as 
interest or discount rates. If one were to secure 100 such 
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(tuotations and rank them in the order of size the varia- 
tions would be discontinuous, as in the sample of men 
whose heights were measured. But in the case of heights 
the underlying values, if they could be determined for a 
large population, would be marked by continuous varia- 
tion, whereas, were an infinite number of discount rate 
quotations secured, there would still be breaks in the se- 
quence. Discount rates increase or decrease by one quarter 
or one half of one per cent, not by infinitesimal amounts. 
Such a series is termed discrete, or non-continuous. 

The smoothing process provides a means of securing an 
approximation to the distribution of vtiluos as they would 
be found if a sample could be increased indefinitely in size. 
It is based upon the assumption that the irregularities 
found in the sample actually studied are accident al, and 
that the underlying values would show (“ontinuous and un- 
broken variation. Obviously, thereof ore, it is only fully 
justified when applied to a continuous series. \ hist.ogram 
of human heights may be smoothed in order to seemre a 
representation of the true underlying distribution in the, pop- 
ulation at large, and interpolation based upon this smoothing 
process is valid. But smoothing is quite illogical for a 
markedly discontinuous scriejs. It would be meaningless to 
construct a smooth curve showing the distribution of dis- 
count rates for the purpose of securing the theoretical fre- 
quency of a rate of 4.3675 per cent. In practical statistical 
work, however, it is frequently helpful to handle discrete 
series as though they w(^re continuous, and in those cases 
the smoothing device may l)e employed. But in the inter- 
pretation and use of the smoothed curve the important 
logical distinction between continuous and discontinuous 
variation should be kept clearly in mind. 

CronriATiTE AiuiANGBMENT OF Statisticai.. Data 

For certain purposes it is desirable to arrange data cumu- 
latively, rather than in separate and exclusive classes of 
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the type illustrated ia the frequency tables presented above. 
The following material will illustrate some of the advantages 
of this arrangement. 

In a study of the durability of telephone poles ^ these 
results were secured: 


Table 12 

Frequency Distribution of 248,707 Telephone Poles, Classified 
According to Length of Life 


Length of life 

Number of poles 

(years) 

(frequency) 

0- 0.9 

1,150 

1- 1.9 

2- 2.9 

4,221 

10,692 

3-3 9 

13,966 

A- 4.9 

16,633 

5- 5.9 

18,211 

6- 6.9 

19,011 

7-7 9 

19,260 

8- 8.9 

20,909 

9-9 9 

19,879 

10-10 9 

20,764 

11-11 9 

15,454 

12-12 9 

14,237 

13-13 9 

13,779 

14-14 9 

9,764 

15-15 9 

8,534 

16-16 9 

7,659 

17-17 9 

6,918 

18-18.9 

4,591 

19-19.9 

1,798 

20-20.9 

815 

21-21.9 

313 

22-22.9 

102 

23-23.9 

47 


The table shows that 1,150 poles were scrapped during 
the first year of use, that 4,221 were scrapped after reaching 
the age of one year and before reaching the age of two 

> “Beplaoement Insurance,” Edwin Kurtz. AdminisMion, July, 1921, 
41-69. 
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years, and so on. This is simply a frequency table of the 
ordinary type. A much more significant, arrangement for 
many purposes is secured when the figures arc assembled 
cumulatively, as in the following table. 

Table 13 

Cumukdwe Distribution of 248,707 Trlcphonr Poku, Classified 
According to Length of Life 
(Cumulated upward^ 


Length of life, 

Number of jwks surriring 
(frequeneii) 

Lew than 1 yoar 

1,150 


“ 2 years 

.5,371 

ti 

“ 3 

U 

16,0(13 

ti 

“ 4 

it 

30,020 

tt 

“ 5 

it 

46,6(12 

(1 

" () 

tt 

64,873 

ti 

“ 7 

tt 

83,88-4 

ti 

" 8 

tt 

103,1-14 

a 

“ 9 

tt 

124,().'53 

ti 

" 10 

ti 

143,032 

ti 

“ 11 

a 

164,(10(1 

ti 

“ 12 

tt 

180,1.50 

ti 

“ 13 

it 

104,387 

ti 

" 14 

it 

208,160 

ti 

“ l.l 

tt 

2i7,o:«) 

ti 

“ 16 

tt 

226,464 

ti 

.< 17 

tt 

234,123 

ti 

“ 18 

tt 

241,0-11 

tt 

“ 19 

tt 

24.5,032 

ti 

“ 20 

it 

247,430 

tt 

“ 21 

tt 

248,24.5 

tt 

“ 22 

tt 

248,6.58 

tt 

“ 23 

tt 

248,660 

tt 

« 24 

tt 

248,707 


It is important to note that it is possible to cumulate a 
frequency series in two different ways. From the above 
table we may determine readily the number failing to at- 
tain any ^ven age. It is often more (sonvenient to reverse 
the procMS, so that the table will enable the total number 
above any given value to be immediately determined. When 
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the telephone pole figures axe thus cumvlaied downward 
the following table is secured. 


Table 14 


Cumulative Distrdnition of 248,707 Telephone Poles, Classified 
According to Length of Life 

(Cumulated dovmward) 


(1) 

Length of life 

(2) 

Number of poles surviving 
(frequency) 

(3) 

Per cent 

0 and more 

248,707 

100.0 

1 year “ “ 

247,557 

99.5 

2 years “ “ 

243,336 

97.8 

2 H ti (( 

232,644 

93.6 

^ ft (i f* 

218,678 

88.0 

5 “ '' 

202,045 

81.2 

0 ft tf ft 

183,834 

164,823 

73.8 

J ti ti ti 

66.3 

^ it ti it 

145,503 

58.5 

9 ff ff it 

124,654 

50.1 

10 “ 

104,775 

42.1 

ff ff it 

84,011 

33.8 

12 ‘‘ 

68,557 

27.6 

13 » 

54,320 

21.8 

it ff ff 

40,541 

16.3 

16 “ 

30,777 

12.4 

16 ‘‘ “ “ 

22,243 

8.9 

17 “ “ ** 

14,584 

5.9 

18 “ 

7,666 

3.1 

19 ** “ ** 

3,075 

1,277 

1.2 

20 “ “ 

0.5 

21 << ** “ 

462 

0.2 

22 “ 

149 

0.06 

23 ft ff ff 

47 

0.02 

24 “ 

0 

0.00 


Cumulative tables such as those given above have dis- 
tinct advantages in the handling of many types of data. life 
tables are generally presented in this form. The scientific 
study of depreciation will lead to the construction of elab- 
orate ‘'mortality tables” for various types of equipment, 
and these will be most useful in the cumulative form. It 
is frequently desirable to reduce the frequencies to per- 
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centages, as in column (3) of Table 14, though it should 
not be forgotten that the significance of the percentages 
depends upon the absolute numbers upon which they are 
based. 

THE OGIVE, OK CUMULATIVE FREQUENCY CURVE 

The general utility of such cumulated data is limited by 
the classification system necessarily adopted in condensing 

Number 
of Poles 



Fio. .“M. — Ciunulativ<‘ Kroqueucy Curve; Distribution of Telephone Poles 
Classified acconlinK t.o Isuigth of Life ^Cumulated upward) 

the material. Unless we interpolate mathematically we are 
limited to the points on the sciile actually noted in the two 
tables. For this reason, a generalized cumulative curve 
similar to the smoothed frequency curve described in the 
preceding section is desirable. If the values given in Table 13 
be plotted on cobrdinato paper (the length of life in each 
case as abscissa, and the corresponding number of poles as 
ordinate) and a smooth curve drawn through the points 
thus plotted, the cumulative frequency curve shown in 
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Fig. 34 is secured. In Fig. 35 the data of Table 14 are 
plotted. 

Such a curve constitutes one of the most effective and 
useful representations of a frequency series. It is obvious 
that the limitations of the particular class-interval adopted 
are in large part removed; the shape of the curve will be 
fundamentally the same, though the class-interval and num- 

Number 
of Poles 



Fio. 36. — Cumulative Frequency Curve: Distribution of Telephone Poles • 
Qaadfied according to Length of life (Cumulated downward) 

her of classes may vary. Frequency curves of the usual 
type nxay not be compared unless the groupings are the 
same, but cumulative frequency curves are subject to no 
such restriction. Moreover, uneven class-intervals do not 
distort the ogive, or cumulative curve, as they do the ordinary 
frequency curve. 

The cumulative curve is particularly well adapted to 
interpolation. Thus if it is desired to hiow the number of 
poles surviving less than 16§ years, the value of the ordi- 
nate of the curve having 15§ as abscissa may be approxi- 
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mated from Fig. 34. A value of 222,000 is secured. If the 
number siuviving 8| years or more is desired, a similar 
estimate may be made from Fig. 35. The interpolated 
figure in this case is 135,000. 

Another type of interpolation possible with such a curve 
is the determination of the number of cases falling within 
any given interval. One is not limited to the class-intervals 
marked out in the original tables. For instance, it may be 
desirable to know the number of poles surviving more than 
10§ but less than 15 years. Reading from the tabic or from 
the chart we find that 217,930 poles survived less than 
15 years. Interpolating on the chart in the manner de- 
scribed above a figure of 154,000 is secured for the number 
surviving less than lOj years. Subtracting the latter figure 
from the former we have 63,930 as the number of poles 
falling within the 10^ to 15 years interval. The figure is, 
of course, an approximation to the true value, as are all 
values seemed through such smoothing and interpolation. 

It should be noted that the ogive may be derived directly 
from the array, without the formation of a frequency table 
as an intermediate step. This curve, in fact, may be looked 
upon as merely a graphic representation of the array. It 
represents one of the simplest forms of statistical organi- 
zation, as well as one of the most effective methods of 
manipulating quantitative data. 

BBLA.TION BETWEEN THE OGIVE AND THE FEEQUENCV 

CURVE 

The ogive and the frequency curve are merely two dif- 
ferent arrangements of precisely the same material, each 
arrangement having certain distinctive advantages. The 
characteristics of each may be more clearly apparent if the 
structural relationship between these two curves is tmder- 
stood. This relationsMp is graphically portrayed in Fig. 36.^ 

^ The suggestive ajrrangemeut shown in this figure was originated by Robert 
E. Chaddock. 
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Labor Cost per 1000 Feet 


Fig. 36. — Distribution of Sawmills in the United States Classified ac- 
cording to Labor Cost in 1921. Illustrating the Structural Relation 
between the Ogive and the Frequency Curve. 
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This figure is based upon the following frequency table, 
showing the distribution of sawmills in the United States, 
classified on the basis of labor cost per 1,000 feet of lumber 
produced.^ 


Table 15 

Frequency Disiribution of 269 Sawmills in the United States Classified 


According to Labor Cost in 1921 

Ldbar cost {ad employees) per 

Number of establishments 

1,000 /eet, board measvre 

(frequency) 

S1.00-$1.49 

3 

1.50- 1.99 

10 

2.00- 2.49 

14 

2.50- 2.99 

22 

3.00- 3.49 

38 

3.50- 3 99 

40 

4.00- 4 49 

38 

4.50- 4.99 

33 

5.00- 5.49 

20 

5.50- 5.99 

11 

6.00- 6 49 

10 

6.50- 6.99 

11 

7.00- 7.49 

8 

7.50- 7.99 

4 

8.00- 8.49 

4 

8.50- 8.99 

3 


269 


The upper part of Fig. 35 indicates the method by which 
the ogive is built up. Just as in the histogram, the area of 
each rectangle is proportional to the number of cases falling 
in the given class. Since the operation is a cumulative 
one, however, the base of each rectangle is the cumulated 
frequencies of all preceding clasps. Thus the y-value (fre- 
quency) of the first rectangle is 3, erected from zero as a 
base, the 2 /-val\ie of the second class is 10, erected from 3 
as a base, and so on. The slope of the curve connecting 
these rectangles is gradual at first when the frequencies 

^From ''Labor Efficiency and Productiveness in Sawmills/’ Ethelbert 
Stewart, Monthly Labor RevieWf January, 1923, 14. Seven scattered cases 
above $9 ,00 in ^ue have been omitted from the table and the accompanying 
graph. 
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are low, then steeper as the frequencies become greater, 
and finally tapers off as the frequencies decrease near the 
upper limit of the distribution. This is the cumulative 
frequency curve, or ogive. 

When the various rectangles representing the class-fre- 
quencies are dropped to the zero line as a common base, 
the a>-values remaining the same throughout, the histogram 
or column diagram described in an earlier section is secured. 
From this the frequency polygon or smoothed frequency 
curve may be derived. 
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CHAPTER IV 


DESCRIPTION OF THE FREQUENCY 
DISTRIBUTION: AVERAGES 

The classification of quantitative data and the construc- 
tion of a frequency distribution constitute an important 
stage in the task of organization and analysis. By means 
of classification the underljdng structure of the data may 
be revealed and the essential unity of a mass of material 
may be brought out. But this is only the first step in statis- 
tical analysis. It remains to develop methods of measuring 
and expressing more concisely the significant characteristics 
of a body of data. For certain purposes the frequency dis- 
tribution itself must be summarized and condensed, must be 
boiled down until its essence has been distilled into three or 
four significant figures. 

If each frequency distribution constituted a novel and 
unique problem, obeying a law peculiar to itself, the task 
of studying and describing such distributions would be a 
difficult one. Fortunately this is not so. Quantitative 
data in widely different fields, when assembled in frequency 
distributions, show certain common characteristics, obey 
certain general laws. Experience in one field, therefore, 
constitutes a guide to work in others. Uniformity m the 
behavior of masses of data makes possible the development of 
a generalized method of organizing, analyzing and comparing 
measurements drawn from many fields of scientific study. 

Comparison op Frequenct Distributions 

This fact of a common law of arrangement running through 
the universe of quantitative facts may be brought home 
most effectively by a comparison of distributions illustrative 
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of various types of data. The characteristics of the frequency 
distributions and of the frequency curves which follow should 
be noted, and the distributions compared. 



Fig. 37. — Frequency Curve: Distribution of 18,780 Soldiers Classified 
according to Height 


The curve in Fig. 37 is based upon the following data 
relating to the heights of 18,780 soldiers.* 

Table 16 


DiMribvtion of Soldiers Classified According to Height 


Heightininches 

Number of soldiers 

Heightininches 

Number of soldiers 

60 + 

197 

67 + 

3,017 

61 + 

317 

68 + 

2,287 

62 + 

692 

69 + 

1,599 

63 + 

1,289 

70 + 

878 

64 + 

1,961 

71 + 

520 

65 + 

2,613 

72 + 

262 

66 + 

2,974 

73 + 

174 


Total 


18,780 

1 From G. C. Whipple, Vital ^toHstica 

, DTew York, Wiley, 1919, 377. 
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Fig. 38 depicts a frequency curve based upon 1,000 ob- 
servations, made at Greenwich, of the Right Ascension of 
Polaris.^ The values on the abscissa define deviations, in 
seconds of time, from an origm near the mean of all the 
observations. Frequencies of occurrence of given values on 
the cc-scale are measured, of com-se, as ordinates on the 



-3.5-3.0-2.5-2.0-1.5-1.0 -5 0 +.5+1.0+1.5+2.0+2.5+3.0+3.5 
Magnitude of Deviation in Seconds of Time 

Fig. 38. — Frequency Curve: Distribution of Errors of Observation in 
Astronomical Measurements 

j/-scale. The distribution plotted in Fig. 38 is given in 
Table 17 on page 89. 

If a piece of artillery be accurately adjusted on a given 
target (a point) and 100 shots be fired, it will be found 
that the points of impact of the hundred shots will be dis- 
persed about the target. No matter how accurate the piece 
or the adjustment only a small percentage of the shots 
will faU upon the exact point at which they were directed. 
The points of impact will be scattered about the target 
in a quite regular fashion, however. If a rectangle be so 
drawn as to include aU the points of impact, and this reo« 

' E- T. Whittaker and G. Bobinson, The Calculus of ObsenaUms, 
Jjondon, Blackie and Son, 1924, 174. 
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Table 17 

Distnbiiiim of Errors of Observation in Astronomical Measurements 


(1000 observations of the Right Ascension of Polaris) 


Magnitvde of deviaiion, 
in seconds of time, from origin 

Number of observations 

- 3.6 

2 

- 3.0 

12 

- 2.5 

25 

-2 0 

43 

- 1.6 

74 

- 1.0 

126 

-0.6 

160 

0 

168 

0.5 

148 

1 0 

129 

1.5 

78 

2.0 

33 

2.5 

10 

3.0 

2 

1,000 


tangle (or zone of dispersion) be divided into eight equal 
parts, the distribution of shots withia these sections wUl be 
as indicated in Fig. 39. (In any given case there are likely 


2 

7 

16 

25 

25 

16 

fl 

2 


Fig. 39. — Zone of Dispersion, Artillery Firing, Showing the Theoretical 
Percentage Distribution of Shots 


to be slight departmes from this order, but in the loi^ run 
this distribution will prevail.) 

This general rule holds for all classes of guns. The more 
accurate the gun the smaller will be the zone of dispersion, 
but the distribution within this zone is theoretically the 
same in all cases. Rules of fire used in artillery adjustment 
are based upon this fact. 
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The results of actual firing may be contrasted with this 
theoretical distribution. Table 18 presents a record of one 
thousand shots fired from a battery gun at the middle of a 
stationary target two hundred yards distant.^ The target 
was divided by horizontal lines into eleven equal divisions. 


Table 18 


Distribution of One Thousand Shots from a Single Gun 


Division 

1 (top) 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 (bottom) 


Number of shots recorded 
1 
4 
10 
89 
190 
212 
204 
193 
79 
16 

2 

1,000 


These results are presented graphically in Fig. 40. 

The zone of dispersion being divided into eleven divisions 
instead of the eight referred to in describing the theoretical 
distribution, a direct comparison cannot be made. We 
have here, however, the same general type of distribution 
found in the other examples given. A tendency toward 
concentration m the lower half of the target reflects a slight 
departure from symmetry. 

When coins are tossed the distribution of heads and tails 
is assumed to be determined by pure chance. In a single 
experiment ten coins were tossed 100 times. The following 
table shows the frequencies with which given numbers of 
heads appeared. (The greatest number of heads possible 

^ This escperiment is recorded in the Report of the Chief of Ordnance, 1878, 
Appendix 8. The results are given in The MeOutd of Least Squares, Mansfield 
Mertiman, New York, Wiley, 1897, 14 
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in a given throw under such conditions is, of course, 10; 
it is also possible that no heads should appear.) 



Divisions 

Fig. 40. — Column Diagram: Distribution of 1,000 Shots from a Single 

Gun 


Table 19 


Distribution of Results in Coin Tossing Experiment 
(Ten coins tossed 100 times) 

Number of heads Frequency of occurrence 


0 

9 

8 

7 

6 

5 

4 

3 

2 

1 

0 


0 

1 

4 
7 

23 

30 

20 

9 

5 
1 

_0 

100 
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Figure 41 depicts the above frequency distribution. 



Experiment 

Distkibution of Econobhc Data 

We find in these fom widely different fields something 
approaching a uniform law of arrangement of quantitative 
data. The examples which have been given, however, do 
not represent the world of economic facts. Do economic 
data show the same general characteristics? If reference 
be made to the examples given in Chapter III, comparisons 
with the four preceding illustrations may be made. The 
frequency distributions referred to are those relating to 
weekly earnings of employees, the length of life of telo- 
phone poles, the distribution of labor cost in sawmills and 
the distribution of incomes below $4,000 in the United 
States. (The curve of the latter distribution, it should be 
noted, would show a long tail extending far to the right if 
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the incomes above $4,000 were included.) Several additional 
examples of economic data may be given. 

Figure 42 illustrates the order in which price variations are 
distributed. It is based upon a study made by W. C. Mitchell 
of 5,578 individual cases of change in the wholesale prices 



Percentage of Fall Percentage of Rise 


Fig. 42. — Frequency Polygon: Distribution of 5,640 Cases of Change 
in the Wholesale Prices of Commodities from One Year to the Next 
(after Mitchell) 

of commodities from one year to the next.^ Thus, for ex- 
ample, the average price of middling upland cotton in 
New York in a given year was $0,115 per pound. In the 
following year the average price was $0,128 per pound, an 
increase of 11.3 per cent. This would constitute one entry 
in the table of rising prices, falling in the class “10-11 . 9%.” 
The entire table consists of 5,578 such entries. These data 
are presented in Fig. 42 in the form of a frequency polygon, 
no attempt being made to smooth the curve. 

^ From BvMin iS84t TJ. S. Bureau of Labor Statistics, Part I, *‘The MaJdng 
and Using of Index Numbers,” 18. The figure shows the price changes only 
within the range of a 51 per cent fall and a 51 per cent rise. One case of a 
price fall of 55 per cent is not shown, and 37 oases of price increases ranging 
from 52 per cent to 104 per cent have not been include. 
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Table 20 shows the distribution, of London-New York 
exchange rates (sterling exchange) from 1882 to 1913, in- 
clusive. This was a period when both currencies were freely 
convertible into gold, at fixed ratios, with customary market 
forces operating to keep exchange rates between the two 
“gold points.” Observations covering recent decades would 
show quite different characteristics. In the distribution 
shown graphically in Fig. 43 monthly rates have been 



Fig. 43. — Frequency Polygon: Distribution of London-New York 
Exchange ]^tea (as recorded over a period of 384 months) 

classified according to the frequency of their occurrence over 
thirty-two years of pre-war experience.^ 

A fairly typical distribution of wage-earners classified 
according to the amount of their weekly earnings, is shown 
in Table 21 and, graphically, in Fig. 44. The data relate to 
13,427 steel workers in open-hearth furnaces, in the United 

^ “The figures are , . , theaveragesof those quoted at the begiiuung of each 
zncmth in the Economist; on and after July, 18^, the exchange is the tele- 
graphic transfer/ before that date, ‘short at interest/ ” The data are taken from 
An Academic Stvdy of Some Money Market and Other Statistics, by E. G. 
Peake. London, P. S. King, 1923. Appendix I. 
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Table 20 

Distribution of London-New York Exchange Rales as Recorded by 
Months during the Period 1882-1913 

Frequency 

(number of months given 
rate prevailed) 

1 
6 
11 
21 

23 

24 

25 
40 
46 
49 
35 
46 
33 
16 

8 
1 

_1 
384 


5,000 


4,000 


6*3,000 


2,000 


1,000 


Eig. 44. — Distribution of Wage-Earners in Open-Hearth Furnaces, 
Classified according to Average Weekly Earnings in 1936 

95 



Class-interval 


«4.8275-$4.8324 
4.8325- 4.8374 
4.8375- 4 8424 
4.8425- 4 8474 
4.8475- 4.8524 
4.8525- 4 8574 
4.8575^ 4 8624 
4.8625- 4.8674 
4.8675- 4.8724 
4.8725- 4.8774 
4.8775- 4 8824 
4.8825- 4.8874 
4.8875- 4.8924 
4.8925- 4 8974 
4.8975- 4.9024 
4.9025- 4.9074 
4.9075- 4.9124 
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States, in 1985. There is a clear concentration of workers 
whose earnings fall between $16 and $24 a week. The dis- 
tribution is markedly skewed, however, with a tail extending 
far to the right. The range of weekly earnings, like that of 
incomes in general, is far greater above the mode than 
below. 


Table 21 

Distribution of Wage-Earners in Open-Hearth Furnaces in the United 
States, Classified According to Average Weekly Earnings in 1936 
(Total for all districts) 


Class-interval 

Frequency 

(in dollars 

(number of workers 

per toeek) 

earning stated amount) 

$ 0-$ 7.99 

583 

8- 16.99 

2,200 

16- 23.99 

4,462 

24r- 31.99 

3,032 

32- 39.99 

1,527 

40- 47.99 

764 

48- 56.99 

358 

56- 63.99 

210 

64- 71.99 

144 

72- 79.99 

44 

80- 87.99 

36 

88- 95.99 

21 

96- 103.99 

26 

104r- 111.99 

3 

112- 119.99 

7 

120- 127.99 

1 

128- 135.99 

9 


13,427 


The frequency curves and histograms based upon eco- 
nomic data, it will be noted, do not all show the synametry 
and regularity which seem to characterize the curves rep- 
resenting physical data. Some are non-symmetrical, showing 
a preponderance of cases on one side of the point of greatest 
concentration. In some there are breaks in the regularity of 
the increase or decrease of frequenci^. But in spite of 
these differences there is obviously a family resemblance 
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between the measurements drawn from the fields of eco- 
nomics, astronomy, anthropometry, ballistics, and pure 
chance. Certain of the common characteristics may be noted. 

GEinsBAL Characteristics of Frequency Distributions 

There is, in the first place, variation in the values of the 
measurements secured. Human heights vary, astronomical 
measurements of the same quantity differ, projectiles fired 
under conditions as nearly constant as it is humanly possible 
to make them fail to land at the same spot, incomes vary 
as between individuals, and exchange rates move from week 
to week and month to month. The various observations or 
values secured in a given case are distributed along a scale, 
between two extreme values. 

The distribution of these values along the scale (the 
o^axis) is such that, moving from one extreme value to- 
wards the other, the cases found at successive points along 
the scale (the successive class frequencies) increase with 
more or less regularity up to a maximum, and then de- 
crease in much the same way. In spite of variation, there- 
fore, we find a ceriird tendency, a massing of cases at certain 
points on the scale of values. This is the second notable 
characteristic which all the frequency distributions appear 
to possess in common. 

If we measure, for each of the successive classes, the 
amount of deviation along the scale from the point of 
greatest concentration it will be noted that small deviations 
are much more frequent than large ones, that extreme 
deviations are rare, and that deviations on both sides of 
the point of concentration reach perfect (or almost perfect) 
equality in the examples taken from the physical sciences 
and from the field of pure chance, and approximate equality 
in the economic distributions. (Exceptions to this rule 
of approximate equality on the two sides of the point of 
greatest concentration are not infrequent, the example of 
income distribution being a rather striking case in point.) 
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Figure 45 depicts a curve which is termed the “proba- 
bility curve,” or the “normal curve of error.” Its charac- 
teristics will be discussed in greater detail in a later section. 
At this point it is presented merely as a basic type which 
some of the above examples approach closely, and from 
which others of the examples represent more or less pro- 
nounced deviations. Departures from this tjrpe, let it be 
emphasized, are numerous and significant, but as a basic 



form this normal curve of error is extremely important in 
statistical work. Even the most important variations from 
this type resemble it with sufficient closeness to justify the 
use of a generalized method of describing frequency distri- 
butions. Distributions of quantitative data vary, and their 
variations from each other and from certain standard types 
are of the greatest significance, but in spite of their variar 
tions a family resemblance runs through them all. Each 
new frequency distribution is not an isolated phenomenon, 
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but a member of a large family, and as such the problem 
of describing and analyzing it may be approached with 
confidence in methods which have been found applicable in 
other cases. 

Given this more or less common type, how may a given dis- 
tribution be described and differentiated from others? Certain 
methods will have been suggested by the preceding discussion. 

Methods of Describing a Frequency Distribution 

The values of all the observations, it has been noted, are 
spread along a scale. The frequency distribution may be 
described by the selection of a single value on that scale 
which is thoroughly representative of the distribution as a 
whole. Since the frequencies vary, an obvious choice is 
the selection of that value which occurs the greatest number 
of times, or, in other words, that point on the scale at 
which the concentration is greatest. This value consti- 
tutes a measure of the centred tendency of the distribution. 
Thus, one might find the income class in which the greatest 
number of people fall, and let the mid-point of that class 
(which is $950 in the distribution presented in Table 11) 
serve as the representative of the distribution. This most 
conunon value, it should be noted, is only one of several 
possible measures of the central tendency of a given dis- 
tribution. All such measures are termed averages. 

A single representative value such as this has many uses 
but, by itself, it obviously leaves out many facts concern- 
ing the distribution. Of great importance is the character 
of the distribution about the average. Are the values of all 
tabulated cases closely concentrated, or is there pronounced 
dispersion over a wide range? The representative character 
of any average depends upon how closely the other values 
cling to it, upon the degree of concentration about the 
central tendency. The average, therefore, must be supple- 
mented by a measure of variedion, a measure of the “scatter” 
about the central value. 
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An adequate description should include also an account 
of the degree of symmetry of the distribution. It is highly 
important to know whether there is an equal distribution 
of cases on each side of the point of greatest concentration, 
or whether the frequency curve is skewed to one side, as in 
the case of income distribution illustrated above. If the 
curve is not symmetrical the degree of asymmetry should 
be determined, and for this purpose measures of skewness 
have been developed. 

It is, finally, possible to measure the degree of peaked- 
ness of frequency curves, by comparing them with the 
normal curve of error as a standard. It is obvious that 
the frequency polygon representing price changes (Fig. 41) 
would, if smoothed, constitute a curve much more peaked 
than the normal curve, and this fact of pronounced con- 
centration at the central value is highly significant. This 
characteristic of frequency curves is called kurtosis, and 
the measurement of kurtosis constitutes the final step in 
the description of the frequency distribution. 

When these various measures have been secured the task 
of statistical analysis will be well \mder way. The chaotic 
assortment of data with which we started will have been 
reduced to workable form in the shape of a frequency table, 
and the essential facts which the table reveals wfil have 
been distilled into three or four significant measures. This 
process not only reveals the characteristics of the given 
distribution, but also facilitates comparison with similar 
distributions. For example, it is impossible to compare 
some tens of millions of unorganized personal income figures 
for the United States with similar data for Great Britain. 
But if we secure a value for the average or most repre- 
sentative income for each coimtry, together with a descrip- 
tion of the distribution of personal incomes about that 
central value, a legitimate basis for comparative study is 
obtained. In manipulating and analyzing masses of ma- 
terial, whatever the purpose of study may be, full use 
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shoiild be made of the power to condense, simplify and 
compare which is given by the measures employed in de- 
scribing the frequency distribution. 

The succeeding section is devoted to a discussion of one 
phase of this descriptive process, that concerned with the 
measurement of central tendencies. After the development 
of this subject of averages, problems relating to measures 
of variation and of skewness will be dealt with. 

Avekaoes 

We have seen that the representation of a frequency 
distribution by an average, a single typical figure, is justi- 
fied because of the tendency of large masses of figures to 
cluster about a central value, from which the values of all 
observed cases depart with more or less regularity and 
smoothness. It is solely because of the concentration of 
cases about a central point on the scale that such repre- 
sentative figures have significance. The average represents 
the distribution as a whole only because it is a typical 
value. If the individual items entering into a distribution 
vary widely in value and show no tendency toward con- 
centration, no single value can represent them. Thus the 
arithmetic mean of the three numbers 3, 125, 1,000 is 376, 
but 376 in no way represents the three values on which it 
is based. This fundamental requirement, that there be a 
tendency toward concentration about a central value, must 
be met if an average is to be at aU representative. 

If the general character of a frequency distribution be 
recalled the logic of one sort of average will be clear at 
once. It was suggested above that that point on the »-scale 
at which the concentration is greatest, that value which 
occurs the greatest number of times, mi^t be taken as 
typical of the entire distribution. This value is termed 
the mode, and the group in which it falls is called the modal 
group. If a frequency curve be drawn to represent a given 
distribution, the mode will be the x-vdLue corresponding to 
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but 376 in no way represents the three values on which it 
is based. This fundamental requirement, that there be a 
tendency toward concentration about a central value, must 
be met if an average is to be at all representative. 

If the general character of a frequency distribution be 
recalled the logic of one sort of average will be clear at 
once. It was suggested above that that point on the x-scale 
at which the concentration is greatest, that value which 
occurs the greatest number of times, might be taken as 
typical of the entire distribution. This value is termed 
the mode, and the group in which it falls is called the modal 
group. If a frequency curve be drawn to represent a given 
distribution, the mode will be the x-vcdue corresponding to 
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the maximum ordinate.^ The maximuin ordinate itself meas- 
ures the frequency of the modal group. Students frequently 
confuse these two values in determining the mode. It is 
not the distance along the y-scale but the distance along 
the a>-scale which measures the value of the mode. The 
ordinates merely measure the number of cases falling in the 
several classes, not the values of the cases falling in those 
classes. 

As typical of a given distribution we might also select 
that point on the scale of x-values on each side of which 
one half the total number of cases fall. This value, which 
is called the median, is that which exceeds the values of 
one half the cases included, and is in turn exceeded by the 
values of one half the eases. Thus it has been estimated 
that in 1918 the median value of personal incomes in the 
United States was $1,140; one half of the 37 million recipi- 
ents of personal incomes received less than this sum, while 
one half received more. When a distribution is represented 
by a frequency curve, the area xmder the curve is divided 
into two equal parts by an ordinate erected at that point 
on the* a>-axis corresponding to the median value. This 
follows, of course, from the definition of the median, and 
from the fact that the area \mder a frequency curve repre- 
sents the total number of cases included in the distribution. 

The arithmetic mean is a third type of average which 
may be used to represent a distribution. This is a cahw‘ 
lated average, affected by the value of every item in the 
distribution. Herein, obviously, it differs from the mode 
and the median, which depend primarily upon the relative 
position of the items in the frequency table, and are not 
affected by the values of all individual items. The arith- 
metic mean is the center of gravity of a distribution; it 
would be the a5-value of the point of balance of a frequency 

1 Strictly speaking, the mode is the x-value corresponding to the maximum 
ordinate of the ideal frequency curve which has been fitted to the given distri- 
bution. 
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curve, if the curve could be blocked out and manipulated 
in solid form. 

The geometric mean and the harmonic mean are two other 
averages the characteristics of which will be discussed at a 
later point. 

The computation or location of these various averages 
may involve somewhat lengthy processes if the number of 
cases included be great. If appropriate methods be em- 
ployed, however, the labor of computation may be materi- 
ally cut down. The use of the following symbols will sim- 
plify the explanation of these methods: 


M; 

Mo: 

Md: 

m; 


f: 

N: 

2 (Sigma): 


Arithmetic mean. 

Mode. 

Median. 

The value of an individual observation; in a fre- 
quency distribution, the value of the midpoint of 
a class. 

The number of items (observations) in a given class 
in a frequency distribution. 

The total number of items in a given series or fre- 
quency distribution. 

The symbol for the process of summation, meaning 
“the sum of. ” 


The Computation of the Arithmetic Mean 

Using the above notation, the formula for the arithmetic 
mean is 

N' 

Thus the mean of the measures 2, 5, 6, 7, is equal to the 

20 

sum of these measures divided by 4, which is -j- or 5. The 

computation of the arithmetic mean when each measure is 
reported at its true value is thus a simple process of sum- 
mation and division. The weekly earnings of 210 factory 
employees were listed in an earlier section. If these figures 
be added, and the total divided by 210, the mean weekly 
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wage is found to be $26 . 983. In this case the task of add- 
ing 210 items is somewhat tedious; it is a task which would 
become almost impossible if one were dealing with the 
37 million personal income figures, for example. For prac- 
tical reasons, therefore, it is usually necessary to compute 
the required averages from the frequency distribution rather 
than from the original ungrouped data. To exemplify this 
process we may utilize data relating to the weekly earnings 
of steel workers in the Pittsburgh District in 1935. 

The importance of certain of the precautions mentioned in 
the section on classification, in connection with the choice 
of a class-interval, will be clear from this example. When 
the mean of a distribution is calculated from classified ob- 
servations, we must assmne an even distribution of cases 
within each class. The class-interval should be selected 
with this in mind, in order that errors introduced by the 
assiunption may be minimized. If the items in each class 
are evenly distributed, the mid-value of each class may be 
taken as representative of all the observations included; 
when such a mid-value is multiplied by the number of items 
in the class, the product is approximately equal to the sum 
of all the individual items in the class. The formula for the 

mean thus becomes M = Table 22 illustrates the 

procedure in detail. 

The value secured in this way is sometimes called a 
weighted arithmetic mean. "What we do, in effect, is to 
secure the arithmetic mean of the 28 figures in the column 
headed m. We do not take a simple average of these fig- 
ures, however, but weight each one in proportion to the 
nmnber of cases falling in the class-interval of which it is 
the mid-value. It is precisely the procedure we should fol- 
low in calculating the mean of five men’s incomes, two of 
whom, let us say, have incomes of $2,000 and three of whom 
have incomes of $3,000. Clearly it would not do to add the 
figures $2,000 and $3,000, dividing the sum by two. The 
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Table 22^ 


Calculation of the Arithmetic Mean of Weekly Earnings of Workers 
in Open-Hearth Furnaces in the Pittsburgh District in 1936 


CUm4nterval 
dollars per week) 

Mid-point 

m 

Frequency 

/ 

fm 

$ 0-$ 3 99 

2 

67 

134 

4- 7.99 

6 

290 

1,740 

8- 11.99 

10 

437 

4,370 

I2r- 16.99 

14 

730 

10,220 

16- 19.99 

18 

1,056 

19,008 

20- 23.99 

22 

1,009 

22,198 

24- 27.99 

26 

712 

18,512 

28- 31.99 

30 

609 

18,270 

32- 35.99 

34 

334 

11,356 

36- 39.99 

38 

187 

7,106 

40- 43.99 

42 

179 

7,518 

4Ar- 47.99 

46 

105 

4,830 

48- 51.99 

50 

60 

3,000 

52- 55 99 

54 

67 

3,618 

56- 59.99 

58 

28 

1,624 

60- 63.99 

62 

37 

2,294 

64- 67.99 

66 

33 

2,178 

68- 71 99 

70 

29 

2,030 

72- 75 99 

74 

16 

1,184 

624 

76- 79.99 

78 

8 

80- 83.99 

82 

3 

246 

84- 87.99 

86 

8 

688 

88- 91.99 

90 

4 

360 

92- 96.99 

94 

7 

658 

96- 99.99 

98 

9 

882 

100-103.99 

102 

5 

510 

104-107 99 

106 

1 

106 

108-111.99 

Total 

110 

1 

6,031 

no 

145,374 


M 


7^(fm) ^ $145,374 
N 6,031 


$24.1045. 


1 These figures and similar data appearing in subsequent tables were com- 
piled by EdTOrd K. Pra 2 a.er, of the Division of Wages, Hours and Working 
Conditions, TJ. S. Bureau of Labor Statistics. See ^'Ekmings and Hours in 
Blast Furnaces, Bessemer Converters, Open-Hearth Furnaces and Electric 
Furnaces, 1933 and 1935'' Monthly Labor Review, April, 1936. The detailed 
statistics in Table 22 were provided through the courted of Dr. Isador Lubin, 
Commissioner of Labor Statistics. 
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figure $2,000 is given a weight of two, the figure $3,000 is 
given a weight of three, and the resultant sum, $13,000, is 
divided by five. Though the procedure in working from the 
frequency distribution is thus a form of weighting, the term 
“weighted average” is coming to have a more restricted 
meaning, to be explained at a later point, and should not 
in general be applied to an average computed from a fre- 
quency distribution. 

SHORT METHOD OP COMPUTING THE ARITHMETIC MEAN 

The calculation of the arithmetic mean from the fre- 
quency table is much easier, in general, than from the un- 
grouped data, but when the number of eases included is 
large even the computation from the frequency table by 
the method illustrated above may be laborious. The pro- 
cedure may be greatly simplified. 

From the method of computing the arithmetic mean it 
follows that the algebraic sum of the deviations of a series 
of individual magnitudes from their mean is zero. This 
may be readily demonstrated. We may represent the series 
of magnitudes by wii, nh, mt, . . . m„, their arithmetic 
mean by M, and the deviations of the various magnitudes 
from the mean by di, d%, ds, . . . d„. 

Then 

OTi -f- JWs + ms + . . . + 

= M ( 1 ) 

and 

mi-|-m2 + m3 + . • .+mn = NM. (2) 

The number of terms, of course, is equal to N. Therefore, 
subtracting M N times from each side of the equation, 

(nh — M) + (mi — M) + (mt — M) + . . . + (m# — M) —0. (3) 

But 

mi — M == di, m — M di, etc., and equation (3) may be written 

Sd = 0. 

Knowing this to be true we may measure the deviations 
of a series of magnitudes from any arbitrary quantity, 
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secure the algebraic s\mi of the deviations, and from this 
value ascertain the difference between the arbitrary quan- 
tity and the true mean. For this difference will be the 
mean of the deviations from the arbitrary origm. If we 
let M' represent the arbitrary origin, or assumed mean, 
while c = M — M', and di, di, ds' . . . d„', represent the 
deviations of the various magnitudes from M' (i.e., d/ 
= mi — M', dt = OTj — M', etc.), then 

di = di -{• c, dt — ds c, da' = ds c, , . , da — dn c 
and 

Sd' = 2d -i- Ne. 

But 

2d = 0 
2d' = Nc 

and 



From the known values of M' and c the value of the true 
mean may be obtained, for M = M' + c. The procedme is 
illustrated in the following simple example: 

Table 23 

Computation of the Arithmetic Mean {Short Method) 


TU 

/ 

d' 

(Ungrouped data) 

5 

1 

- 15 

II 

to 

o 

15 

1 

-5 

2d' +25 

25 

1 

+ 5 

^ N 5 ^ 

35 

1 

+ 15 

Jl/ = M' + c = 20 + 5 = 25 

45 

1 

+ 25 



5 

+ 26 



When the deviations are measured from 20 as arbitrary 
origin there is in each case a constant error, if the devia- 
tion from the true mean be taken as standard. This error 
is equal to the difference between the true and the assumed 
means. The algebraic sum of the deviations from the 
assumed mean will equal N times this constant error, since 
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the error is repeated once for every item included. By 
dividing the sum of these deviations by N the amount of 
the error may be determined and the value of the mean 
thus obtained. 


Table 24 

CaleuLcMon of the Arithmetic Mean of Weekly Earnings of Workers 
in Open-Hearth Furnaces in the Pittsburgh District in 1935 
(Short method) 


Clasa- 

intervaZ 

Mid- 

Fre- 

d' 

(in doss- 

Sd* 


CodculaliOYvs 

(in dollars 

poiru quency 

interval 


+ 

M* « $30 

per week) 

m 

f 

units) 




S CME 3.99 

2 

67 

- 7 

469 

1. 

Algebraic sum of devi- 

4r- 7.99 

6 

290 

- 6 

1.740 


ations from M* 

&- 11.99 

10 

437 

- 5 

2.185 


- 13,212 

12- 16.99 

14 

730 

- 4 

2,920 


4* 4,323 

1ft- 19.99 
20- 23.99 

18 

22 

1,056 

1,009 

- 3 

— 2 

3,168 

2,018 


- 8,889 

24- 27.99 
28- 31.99 
32- 36 99 

26 

30 

34 

712 

609 

334 

- 1 

0 

+ 1 

712 

2. 

334 

Calculation of c (in 
class-interval units) 

36- 39.99 

38 

187 

-f- 2 



40- 43.99 

42 

179 

+ 3 


637 

0.031 

44- 47.99 

46 

106 

+ 4 


420 


48- 61.99 

50 

60 

+ 5 


300 3. 

Reduction of c to orig- 

52- 55.99 

64 

67 

+ 6 


402 

inal units 

5ft- 59.99 

58 

28 

+ 7 


196 

Class-interval » $4 

60- 63.99 

62 

37 

+ 8 


296 

c (in original units) 

64- 67.99 

66 

33 

+ 9 


297 

1.47388 X $4 

68- 71 99 

70 

29 

+ 10 


290 

5.8966 

72- 75.99 

74 

16 

+ 11 


176 

76- 79.99 

78 

8 

+ 12 


96 4 

Determination of U 

80- 83.99 

82 

3 

+ 13 


39 

84- 87.99 

86 

8 

+ 14 


112 

AT ibf ' 4- c 

88- 91.99 

90 

4 

+ 16 


60 

A/ « $30 - $5.8966 

92- 95.99 

94 

7 

+ 16 


112 

M $24. 1045 

96- 99.99 

98 

9 

4- 17 


163 


100-103.99 

102 

5 

+ 18 


90 


104-107.99 

106 

1 

+ 19 


19 


108-111.99 

no 

1 

+ 20 


20 


Total 


6,031 


- 13,212 + 4,323 


The work of computation may be still 

further abbrevi- 


ated, for observations arranged in the form of a frequency 
distribution, by measuring the deviations in terms of the 
class-interval as a unit. Then, in finally applying the neces- 
sary correction, the difference between the true and assumed 
means may be again expressed in terms of the original units. 
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The method may be illustrated in detail with reference to 
the wage data for which the mean has already been calcu- 
lated. 

The steps in this process of calculating the arithmetic 
mean by the short method may be briefly summarized; 

1. Organize the data in the form of a frequency distribution. 

2. Adopt as the assumed mean the midpoint of a class near the 

center of the distribution. 

3. Arrange a column showing the deviation (d') from the assumed 

mean of the items in each class, in terms of class-interval 
units. This deviation will be zero for the items in the class 
containiag the assumed mean, — 1 for the items in the next 
lower class, + 1 for the items in the next higher class, and so 
on. 

4. Multiply the deviation of each class by the frequency of that 

class, taking account of signs. These products are entered 
in the column fd'. 

5. Get the algebraic sum of the items entered in the colunm/d'. 

6. Divide this sum by the total frequency (N). The quotient is 

the correction (c) in class-interval units. 

7. Multiply the correction (c) by an amount equal to the class- 

interval. The product is the correction in terms of the 
original units. 

8. Add this correction (algebraically) to the assumed mean (M')-, 

the sum is the true mean (M). 

Location op the Mediant 

XJNGHOUPED DATA 

The median is a value of a variable so selected that 
50 per cent of the total number of cases, when arranged in 
order of magnitude, lie below it and 60 per cent above it. 
For many frequency distributions this is a useful and sig- 
nificant value. 

When banHling data which axe not arranged in the form 
of a frequency distribution the location of the median is a 
simple matter. The data having been arranged in order of 
magnitude, it is necessary only to coimt from one end 
until that point on the scale of values is found which divides 
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the number of cases into two equal parts. As a simple 
example we may assmne that the following seven figures 
represent the annual incomes of seven individuals: 

$750 $975 $1,128 $1,450 $1,475 $1,825 $1,950 

The scale of values extends from $750 to $1,950, and 
seven items are arranged along this scale. The value of 
$1,000 has two items on one side and five items on the 
other, so obviously does not conform to our definition of 



Fig. 46. — Illustrating the Location of the Median with Ungrouped Data 
(Personal incomes of seven individuals) 


the median. The value of $1,450, which corresponds with 
the income of one of the seven individuals, is the median 
in this case. Three items lie on each side of this value; or, 
if we assume the central item to be cut in two, 3 J items lie 
on each side of this point. This case is illustrated in Fig. 46. 
This diagram may help to bring out the fact that the 
median is a point on a scale so located that it cuts the 
frequencies in two. 

The problem is slightly different when an even number of 
cases is included. This condition is exemplified m the table 
on page 111 which shows the average earnings per man-hour 
in each of 38 selected industries dining the year 1933. 

In this case the median must be a value on each side of 
which 19 industries lie. Therefore any value exceeding 
$0,425 (average earnings in the prepared feed industry) 
and less than $0,426 (average earnings in the meat packing 
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Table 25 


Av&rage Eaminga per Man-Hour in Selected Manufadwring 


Industries ^ 

Average wage 
per manbaitr 

Silk and rayon goods: Commission throwing 

$ 278 

Cotton goods 

.279 

Cigars 

.299 

Silk and rayon goods: Commission weaving 

.313 

Silk and rayon goods: Regular throwing 

316 

Knit underw’ear 

.319 

Knit outerwear 

.358 

Cigarettes 

.361 

Silk and rayon goods: Regular weaving 

.369 

Wool shoddy 

.370 

Hosiery 

.372 

Cotton small wares 

.378 

Woolen goods 

.395 

Sugar, beet 

.395 

Worsted goods 

.399 

Snuff, and chewing and smoking tobacco 

.402 

Knit cloth 

.414 

Rayon yams 

.421 

Fe^s, prepared 

.425 

Meat pacing 

.426 

Pulp 

.431 

Ice, manufactured 

.436 

Flour milling 

.444 

Paper 

.445 

Carpets and rugs, wool 

.464 

Leather tanning 

.470 

Sugar refining, cane 

.481 

Soap 

.482 

Blast furnaces 

.488 

Felt goods 

.488 

Cereal preparations 

.610 

Steel works 

.519 

Motor vehicle bodies and parts 

.561 

Machine tools 

585 

Motor vehicles 

.610 

Machine-tool accessories 

.621 

Petroleum refining 

.643 

Malt 

.667 


^ From MorWhly Labor Review, October, 1S35, 910. 
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industry) will satisfy the definition of a median. Under 
these conditions, where the median is really indeterminate, 
a value half-way between the two limiting values is accepted, 
by convention. The median of the 38 figures would thus 
be $0.4255. 

In this example the median value does not correspond 
with the earnings in any one industry. This will frequently 
be so when there is an even number of observations. 

GROUPED DATA 

The task of locating the median is essentially the same 
when the data are in the form of a frequency distribution. 
The fact that the real values of the individual itenos are 
not known, because of the grouping by classes, complicates 
the problem slightly. The data in Table 26, relating to 
advertising rates of daily newspapers in the United States, 
may be used in illustrating the method. 

Table 26 

LocaMon of Median, Newspaper Advertising Rates in 1933 
Minimum Line Rates for NaJtional Advertising, 245 Daily Newspapers 
in Cities of 25,000 to 50,000 Population * 


Classinterval 

No. of newspapers 


Rate per line 

charging stated 


{in cents) 

rate 



f 


1.0- 2.99 

6 

N 246 ^ 

TT = -;r ■= 122.5 

3 0- 4 99 

63 

2 2 

5 0- 6.99 

85 

Md = 6.0 + 

7 0- 8 99 

56 

^ \ 85 

9 0-10.99 

21 

» 5.0 + 1.49 

11.0-12.99 

16 

=■6.49 

13 0-14 99 

4 


16 0-16.99 

4 



245 


In the present case the location of the med ian involves 
the determination of that value on each side of which 122.5 
‘ Source: Editor and Publisher, Iniemational Yearbook for 1933 . 
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items lie. We may assume that we start at the lower end 
of the scale and move through the successive classes. When 
we reach the upper limit of the first class (that including 
items having values from 1.0 to 3.0) we have left behind 
us 6 cases, while 239 lie in front of us. When the upper 
limit of the second class is attained, 59 items have been 
passed. The upper limit of the third class has below it 
144 items. Somewhere between the lower and upper limits 
of the third class lies the desired point, that which has 
122.5 items on each side of it. How far must we move 
through this class, from 5.0 to 7.0 in order to reach this 
point? 

It will be recalled that, for purposes of calculation, the 
assumption is made that there is a uniform distribution of 
the items lying within any given class. Since before we 
reach the third class 59 cases have been counted, only 63.5 
of the 85 included in this class are needed to complete the 
desired number, 122.5. On the assumption of even distri- 
bution the required 63.5 cases will lie within a distance 

on the scale equal to of the class-interval. The class- 

oo 

63 5 

interval is 2.0; of 2.0 is equal to 1.49. As we move 

oO 

up the scale, then, having reached 5 . 0, we proceed an addi- 
tional distance equal to 1 . 49. At a point on the scale having 
a value of 6.49 is the dividing line on each side of which 
lie 122.5 cases. This is the value of the median. 

The process of computation is shown at the right of the 
frequency table. The following is a summary of the steps 
involved in the location of the median: 

1. Arrange the data in the form of a frequency distribution. 

2. Divide the total number of measures by 2 ; this gives the 

number which must lie on each side of the point to be located. 

3. Be^ at the lower end of the scale and add together the fre- 

quencies in the successive classes until the lower limit of the 
class containing the median value is reached. 
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4. Determine the number of measures from this class which must 

be added to the frequencies already totaled to give a number 
equal to iV/2. 

5. Divide the additional number thus required by the total 

number of cases in the class containing the median. This 
indicates the fractional part of the class-interval within which 
the required cases lie. 

6. Multiply the class-interval by the fraction thus set up. 

7. To the lower limit of the interval containing the median add 

the result of the multiplication process indicated in (6). 
This gives the value of the median. 

The last three steps constitute merely a simple form of 
interpolation. 

The entire process may be reversed by beginning at the 
upper end of the scale and cotmting downwards. In this 
case the final operation is one of subtraction from the upper 
limit of the interval containing the median. 

N /2 may be a fractional value, as in the example given, 
or a whole number. The operation is precisely the same in 
the two cases. 


Qitartiles and Deciles 

For many purposes it is desirable to locate on the scale 
of values, along which the items constituting a frequency 
distribution are ranged, points dividing the total number 
of measures in other ways. Similar to the median, which 
divides the total number of cases into two equal groups, 
are the quartiles, deciles, and percentiles. The quaortiles, 
as the term implies, are points on the scale which divide 
the entire number of measures into four equal groups, the 
deciles divide the number into ten equal groups, and the 
percentiles divide the total number of cases into 100 equal 
groups. Thus the first quartile is that point on the scale 
below which one quarter of the total number of cases lie 
and above which three quarters of the total number of 
cases lie. The second quartile and the median are identical 
values. The third decile is that point on the scale below 
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which three tenths of the total number of cases lie and 
above which seven tenths of the total number of cases lie. 
In all cases the count begins at the lower end of the scale. 

Example: Location of the First Quartile (QOi Newspaper Advertising Rates 

(See Table 26) 

N/4: = 61.26 

Qi = 5.0 + (2.25/85 X 2.0) 

-6.06 

Example: Location of Eighth Decile (A}> Newspaper Advertising Rates 

(See Table 26) 

N/LO = 24.6 A = 7.0 + (62/66 X 2.0) 

8N/10 = 196 -8.86 

A method of locating median, quartiles, deciles and per- 
centiles graphically is explained below. 

Location of the Mode 

The mode is the value of the variable corresponding to 
the maximum ordinate of a given frequency curve. The 
concept of a modal value is a thoroughly easy one to grasp. 
It is the most common wage, the most common income, 
the most common hei^t. It is the point where the con- 
centration is greatest, a characteristic which is effectively 
brought out by Fechner’s term for this average, Mchiaster 
wert, or thickest value. It is not so easy, however, to locate 
the true modal value in a given case. In general statistical 
work an approximate value only is secured for the mode, 
but for most practical purposes this value is usually suf- 
ficiently accmate.^ 

The method of determining this approximate modal value 
may be illustrated by reference to the distribution shown 
in Table 27 on page 116. 

There is wide dispersion of the 22 cases falling below 40, 
and the existence of this “open-end” class makes it impos- 
sible to compute the mean, as the table stands. The mode 


1 A method of looatmg the mode more aocuratdy is explained in a later 
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Table 27 

Frequency Distribution of Five Per Cent Bonds 

(This table is based upon quotations on the New York Stock Exchange on 
June 13, 1936, on railroad and industrial bonds with coupon rate of 5 per cent) 


Quoted price 

Mid-point 

Frequmcy 

Claas-inierval 

m 

f 

Less than 40 


22 

40- 49.9 

45 

5 

60- 59 9 

55 

5 

60- 69.9 

65 

3 

70- 79.9 

76 

8 

80- 89.9 

86 

9 

90- 99 9 

96 

19 

100-109.9 

105 

49 

110-^119 9 

115 

10 

120-129 9 

126 

3 

130-139 9 

136 

1 
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is therefore an appropriate average to employ in the present 
instance. 

The class having limits of 100-109 . 9 contains the greatest 
number of cases. This appears to be the modal group, and 
the mid-point of this class, 105, may be tentatively accepted 
as the value of the approximate mode. But with different 
classifications quite different values might be secured for 
the mode. When the original bond quotations are tabulated 
with varying class-intervals the following results are secured. 
(Only the frequencies of the central classes are shown. It 
is not necessary, for this purpose, to present each of the 
tables as a whole.) 


(a) 

Claaa-irUerval » 5 
doa^-irUeroal f 
80- 84.9 3 

85- 89.9 6 

90- 94.9 10 

95- 99.9 9 

100-104.9 29 

105-109.9 20 

110-114.9 7 

115-119.9 3 


( 6 ) 

Cla68'4nterval » 2.5 
Class-Interval f 
90 0- 92.49 4 

92.5- 94.99 6 

95.0- 97,49 2 

97.5- 99.99 7 

100.0- 102.49 9 

102.5- 104.99 20 

106.0- 107.49 13 

107.5- 109.99 7 


(c) 

Class^nterval 2.5 
Class^nterval f 
98.75-101.249 6 

101.25- 103.749 17 
103.75-106.249 20 

106.25- 108.749 8 


(d) 

CUut-intenal — 1 
Chm^ntenal / 

100 - 100.0 1 

101 - 101.0 2 

102 - 102.0 0 

103- 10;i.9 10 

104- 104.9 7 

106- 105.9 0 

106-106.0 6 

107- 107.9 4 
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With a class-interval of 5 a value of 102.5 is' secured for 
the mode; with a class-interval of 2.5 a value of 103.75 is 
obtained. A class-interval of 2.5, again, but with different 
class limits, yields a mode of 105. Finally, a class-interval 
of 1 gives a mode of 103 . 5. Further changes in classification 
would give still other values. The mode thus appears to be 
a curiously intangible and shifting average. Its value, for 
the same data, seems to vary with changes in the size of the 
class-interval and in the location of the class-limits. 

These difficulties arise primarily from limitations to the 
size of the sample being studied. The true mode, that 
value which would occur the greatest number of times in 
an infinitely large sample, could be located exactly if we 
could increase indefinitely the ntunber of cases included. 
For, given sufiGicient cases, the approximate mode approaches 
the true mode as the class-interval decreases. Grouping in 
large classes obscures details, and as these classes are re- 
duced in size more of the details are seen and a truer picture 
of the actual distribution is secured. But since most prac- 
tical work is necessarily based upon relatively small samples, 
the increase in the munber of classes reveals gaps and 
irregularities, and causes such a loss of symmetry and order 
that doubt arises as to where the point of greatest concen- 
tration really lies. The different tabulations of bond prices 
furnish an excellent example of this. 

By mathematical methods it is possible to obtain a value 
for the true mode without securing an infinite niimber of 
cases. The smoothing process has been briefly explained. 
One sort of smoothing involves the fitting of an appropri- 
ate type of ideal frequency curve to the data of a given 
frequency distribution. Ilus gives, theoretically, the dis- 
tribution which would be seemed by the process fibrst indi- 
cated, that of decreasing indefinitely the size of the class- 
interval and increasing indefinitely the number of cases. 
The value of the x-variable corresponding to the maxinitim 
ordinate of this ideal fitted curve is the true mode. 
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For most practical purposes approximate values of the mode 
are adequate, and these may be secured by much simpler 
methods. A first and rough approximation may be obtained 
by taking the naid-value of the class of greatest frequency, a 
method suggested above. If the general rules for classifica- 
tion which were outlined in an earlier section have been fol- 
lowed, this procedure will not generally involve a gross error. 

It is possible, given a fairly regular distribution, to secure, 
by a process of interpolation within the modal group, a 
closer approximation than is obtained by accepting the mid- 
value of this group as the mode. Referring again to the 
tabulation of bond prices in Table 27 it will be noted that 
the distribution on the two sides of the modal class is not 
symmetrical. The modal class is that with a mid-value of 
105. The class next below, with a mid-value of 95, contains 
19 cases, while that next above, with a mid-value of 115, 
contains but 10 cases. The disproportion is continued in 
the succeeding classes below and above, more cases being 
bulked below the modal class than above. For other pur- 
poses we have assumed an even distribution of cases between 
the upper and lower limits of each class, but it is probable 
that this is not true of the modal class in the present case. 
Judging from the distribution outside this class, it is likely 
that the concentration is greater in the lower half of the 
class-interval, that is, between 100 and 105. The mode, 
therefore, probably lies below the mid-value 105, rather 
than precisely at that point. We may attempt to locate it 
within the group by weighting, assuming a pull toward the 
lower end of the scale equal to 19 (the number in the class 
next below) and a pull toward the upper end of the scale 
equal to 10 (the number in the class next above). This may 
be ejqpressed by a formula, employing the following symbols: 

I = lower limit of modal class. 
fi - frequency of class next below modal class in value. 
fi — frequency of class next above modal class in value. 
i = cla^intervaL 
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The interpolation formula is 

Mo = 1 + -4-ir X i. 

Js +fi 

Applying this formula to the bond price data presented in 
Table 27, we have 

Mo = 100 + X 10^ = 100 + 3.45 = 103.45. 

A closer approximation may sometimes be secured by bas- 
ing the weights (represented by /s and /i) upon the total 
frequencies of the two or three classes next above the 
modal class and the same number below. If three classes 
on each side are included in the present case, a value of 
102 . 8 is secured for the mode of bond prices. 

In some cases the problem of locating the mode is com- 
plicated by the existence of several points of concentration, 
rather than the single point which has been assumed in 
the preceding explanation. Thus in Table 9, representing 
the distribution of wages, with a class-interval of 25 cents, 
there are two definite modal points. A distribution of this 
type is called bi-modal; when plotted, a frequency curve 
having two humps is obtained. If the data are homogene- 
ous such a distribution is the result of paucity of data and 
of the method of classification employed. It may be due 
to the use of a class-interval too small, with reject to the 
number of cases included in the sample. An approximate 
mode may be determined in such cases by shifting the 
class-limits and increasing the class-interval, carrying on 
this process until one modal group is definitely established. 
This reverses the process by which the true mode may be 
located when the number of cases is infinitely large. Under 
such conditions the class-interval might be reduced until 
it was infinitely small. But with a limited number of cases 
the location of the point where the concentration is greatest 
necessitates increasing the size of the class-interval, in order 
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to get away from the irregularities due to the smallness of 
the sample. 

If the distribution remains bi-modal in spite of changes 
in the class-intervals and class-limits, it is probable that 
the data are not homogeneous, that two different distri- 
butions have by mistake been combined. Such cases are 
not uncommon in biometrical work. The existence of two 
distinct animal species where only one was suspected has 
been revealed in this way. The whole significance of a 
frequency distribution will be lost if the data are not homo- 
geneous, a fact which is as true of work in the field of eco- 
nomic statistics as in any other. 

DETERMINATION OF THE MODAL VALUE FROM MEAN 
AND MEDIAN 

Another method of securing an approximate value for 
the mode, a method based upon the relationship between 
the values of the mean, median and mode, may be em- 
ployed in certain cases. In a perfectly symmetrical distri- 
bution mean, median and mode coincide. As the distribu- 
tion departs from symmetry these three points on the scale 
are puUed apart. If the degree of asymmetry is only mod- 
erate the three points have a fairly constant relation. The 
mode and mean lie farthest apart, with the median one 
third of the distance from the mean towards the mode. If 
the asymmetry is marked, no such relationship may pre- 
vail. Having the values of any two of the averages in a 
moderately as 3 umnetrical frequency distribution, therefore, 
the other may be approximated. In fact, however, the 
method should only be employed in determining the value 
of the mode, as the other two values may be computed 
more accurately by other methods. The value of the mode 
itself should only be determined in this way when more 
exact methods are not applicable or are not called for. 

The following formula is based upon this relationship: 

Mo = Mean — 3(Mean — Md). 
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Applying this formula to the telephone pole data shown 
in Table 12, the following result is secured: 


Mo = 9.33 - 3 ( 9.33 - 9 . 015 ) = 8 . 386 . 

This value is slightly below the mid-value of the modal 
class, 8 . 5, and is also less than the value 8 . 49 which is se- 
cured by weighting within the modal group (using four classes 
on each sideL 

It must be emphasized that there is a fictitious accmracy 
to all these values for the mode. All the methods of locat- 
ing the mode which have been discussed are merely approx- 
imative, a fact not to be forgotten in interpreting and uti- 
lizing the results. 

Geaphic Location op Mode, Median, Quabtileb, and 

Deciles 

A better imderstanding of the frequency curve and of 
the cumulative frequency curve may be secured through a 
brief discussion of certain methods of locating graphically 
some of the statistical measures that have been described. 

The value of the mode may be readily determined from 
a frequency curve of the usual type, for, by definition, the 
mode is the reading on the horizontal scale corresponding 
to the maximum ordinate of such a curve. If this reading 
be tahen from the frequency polygon a rough value will be 
obtained, the mid-value of the class of greatest frequency. 
A closer approximation to the true value of the mode will 
be secured from a curve which has been smoothed, either 
by inspection or by mathematical methods. Figure 47, 
showing a curve (smoothed by inspection) based upon the 
wage data presented in Table 8, indicates how the mode 
may be located graphically. The horizontal reading corre- 
sponding to the ma.ximum ordinate of this curve is $27.50, 
an approximate value of the mode which may be compared 
with the values of $27.69 secured by the weighting process 
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and of $27.3470 secured from the values of the mean and 
median. 


The locations of the median and mean have been indi- 
cated on this chart. It has been pointed out that in mod- 
erately asymmetrical (or skewed) distributions there tends 
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Pig. 47. — Distribution of Weekly Earnings of Employees. A Smoothed 
Frequency Curve, showing the Relation between Mean, Median and Mode 

to be a constant relationship between the three averages 
which have been described, the median lying between the 
mean and the mode, and approximately one third of the 
distance from the former towards the latter. In the present 
case this relationship holds fairly well when the value of 
the mode is approximated from the smoothed curve. The 
irregularities in the original data render the process of 
smoothing by inspection rather arbitrary, however. 

In Fig. 48 the same data are represented by a cumulative 
frequency curve, based upon Table 28 on page 124. The steep- 
ness of a cumulative frequency curve within any given inter- 
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val depends upon the number of cases added within the cor- 
responding interval on the horizontal scale. Thus the curve 
rises gradually at first, then more steeply, and tails off 
gradually at the upper extremity. The value of the mode, 
obviously, is the reading on the horizontal scale correspond- 
ing to the point of greatest steepness. This is the point at 



Fig. 48. — Cumulative Distribution of Weekly Earnings of Employees, 
Illustrating the Graphic Location of Median and Quartiles 

which the increase of frequencies is greatest, the point of 
greatest concentration in the frequency distribution. The 
value of the mode may be approximated from a smoothed 
frequency curve by locating the point at which the slope is 
greatest (which is a point of inflection) and taldng the corre- 
sponding reading on the a>scale. In the present case a value of 
approximately $27 . 50 is secured for the mode by this method. 

Values for the median, quartiles, and deciles may also be 
secured graphically from the cumxilative frequency curve. 
The smoothing of such a curve provides a quite satisfactory 
method of interpolation and, if the scale of the diagram 
is suflSciently large, accurate values may be obtained by 
this method. Locate on the vertical scale (the scale of 
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Tabli: 2S 

CumvkAive Distribution of Wage-Earners in a Manufacturing 
Establishment 

(Classified on the basis of weekly earnings) 


Weekly 

warnings 

Number earning 
. (frequen 

Less than S22.50 

0 

H it 

23.00 

1 

it it 

23 50 

5 

it it 

24 00 

8 

it it 

24 50 

19 

it it 

25 00 

29 

it it 

25 50 

41 

it it 

26 00 

56 

it it 

26 50 

78 

it it 

27 00 

98 

it it 

27 50 

122 

it tt 

28 00 

152 

it tt 

28 50 

169 

tt tt 

29 00 

186 

it tt 

29 50 

193 

it it 

30 00 

199 

it tt 

30 50 

204 

tt it 

31 00 

208 

it tt 

31 50 

209 

it tt 

32.00 

209 

it it 

32.50 

210 


cumulative frequencies) a point distant from the base by 


N 


If from this point a horizontal line be extended to the cumula- 
tive curve, the abscissa of the point of intersection will be 
the value of the median. This value may be easily deter- 
mined by dropping a vertical line from the point of inter- 
section to the avaxis. Figure 48 illustrates the application 
of this method. A value of $27 . 125 is secured for the median 
by this method. By direct interpolation a value of $27 . 1458 
is obtained. The quartiles may be located in precisely the 
same way, the vertical scale being divided into quarters 
and horizontal lines extended to the cumulative curve from 
the points thus located on the vertical scale. 
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For some purposes, particularly those that involve the 
averaging of rates or ratios rather than quantities, none of 
the averages which have been described is suitable. The 
geometric and the harmonic means are tjpes of averages 
that should be familiar because they are particularly ap- 
propriate for such pinposes. 

The Gbometeic Mean 

The geometric mean is the wth root of the product of ti 
measures; its value thus is represented by: 

Mp = -^ai-as-Og . . . On- 

The geometric mean of the numbers 2, 4, 8, is 

Mp = -^2X4X8 
= 

= 4. 

It is obvious from the method of computation that if 
any one of the measures in the series has a value of zero the 
geometric mean is zero. 

The actual computation of the geometric mean is greatly 
facilitated by the use of logarithms. In this form 

Log M = log oi + log fla + log a» + . . . + log On 

The logarithm of the geometric mean is equal to the arith- 
metic mean of the logarithms of the individual measures. 

When the measures, of which the geometric mean is de- 
sired, are to be weighted, the separate wei^ts are intro- 
duced as exponents of the terms to which they apply. Thus 
if we represent the sum of the weights by N and the weights 
correi^onding to the terms oi, Os, a* . . . a„, respectively, 
by Wi, Wa, Wi . . . Wn, the formula for the geometric mean is 

Mp = • aj"‘ • oj** . . . a»”». 

This is equivalent to repeating each term a number of 
times, the number corresponding to the amoirnt by which 
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it is weighted. (This, of course, is precisely what is done 
in securing a weighted arithmetic mean.) When logarithms 
are employed the formula for the weighted geometric mean 
becomes 

, „ Wx log Oi + i 02 log Ol + tVs log as + . . . + U’n log On 
Log Mg * 

A method of computing the geometric mean may be 
illustrated with reference to Table 29, which shows the 
distribution of the prices of 66 preferi’ed stocks paying 
seven per cent dividends. The table is based upon closing 
prices on the New York Stock Exchange and the New York 
Curb Exchange for the week ended July 25, 1936. 

Table 29 


Compviaiion of the Geometric Mean of Preferred Stock Prices 


Chss-dnlerval 

m 

/ 

lug 711 

/logm 

$ 70-$ 89.9 

80 

5 

1 90309 

9.51545 

90- 109.9 

100 

20 

2 00000 

40.00000 

no- 129.9 

120 

27 

2 07918 

56.13786 

130- 149.9 

140 

6 

2.14613 

12.87678 

ISO- 169.9 

160 

8 

2 20412 

17.63296 

IjogM, 

136 16305 
66 

66 

136.16305 

Log J/, = 2 06308 

M, = 115.03 


CHARACTERISTICS OF THE GEOMETRIC MEAN 

The nature of the geometric mean may be understood 
by considering its relation to the teims it represents, as an 
average. 

If the arithmetic mean of a series of measures replace 
each item in the series, the mm, of the measures will remain 
unchanged. Thus, the sum of the numbers 2, 4, 8 is 14. 
The arithmetic mean of these three numbers is 4|; if this 
value be inserted in the place of each of the three measures 
the sum remains 14. It is characteristic of the geometric 
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mean that the prodiict of a series of measures will remain 
imchanged if the geometric mean of those measures replace 
each item in the series. Thus the product of 2, 4, 8 is 64. 
The geometric mean of the three numbers is 4; if this value 
replace each of the three measures the product remains 64. 

Again, it is true of the arithmetic mean that the sum of 
the deviations of the items above the mean equals the 
sum of the deviations of the items below the mean (disre- 
garding signs). The sums of the differences between the 
individual items and the mean are equal. In the case of 
the geometric mean the products of the corresponding ratios 
are equal. If the ratios of the geometric mean to the meas- 
ures which it exceeds be multiplied together, the product 
will equal that seciured by multiplying together the ratios 
to the geometric mean of the measures exceeding it in value. 
For example, the geometric mean of the numbers 3, 6, 8, 9 
is 6. The following equation may be set up: 

6 ^ 6^89 

3^6 6^6 

The last example brings out the most important charac- 
teristic of the geometric mean. It is a means of averaging 
ratios. Its chief use in the field of economic statistics has 
been in connection with index numbers of prices, where 
rates of change are of major importance. A rise in prices 
represented by the change from 50 to 100 is as important 
as a rise from 100 to 200. Yet this equivalence is not brought 
out by the arithmetic mean, which gives double weight 
to the change which involves an absolute difference of 100. 
An example frequently dted is that of two cases of price 
change, one a ten-fold increase, from 100 to 1,000, the other 
a fall to one tenth of the old price, from 100 to 10. The 
arithmetic mean of 1,000 and 10 is 505, the geometric 
mean is VI, 000 X 10, or 100. "When the average is of the 
latter type it is seen that the two equal ratios of change 
have balanced each other. The arithmetic mean, 505, is 
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quite incorrect as a measure of average ratio of price change. 
This subject is discussed at greater length in the chapter on 
index numbers. 

What has been said in an earlier section in regard to the 
advantages of logarithmic charting for certain purposes 
bears upon the use of the geometric mean. This average 
is sometimes called the logarithmic mean, as its logarithm 
is simply the arithmetic mean of the logarithms of the 
constituent measures. Wherever percentages of change are 
being averaged, where ratios rather than absolute differ- 
ences are significant, the xise of the geometric mean is* 
advisable. 

A problem involving the use of the geometric mean arises 
in computing the average rate of increase of any sum at 
compotmd interest. If pa represent the principal at the 
beginning of the period, p» the principal at the end of the 
period, r the rate of interest and n the number of years 
in the period, the sum to which p# will amount at the end 
of the n years, if interest is compounded annually, is repre- 
sented by the equation; 

Vn = Po(l + r)» 

It follows from this that: 



Thus, if $1,000 at compound interest amounts to $1,600 
at the end of 12 years, there has been an increase of 60 per 
cent. The arithmetic mean is 6 per cent, but this is not the 
rate at which the money increased. The true rate is: 



= ‘-5T:60- 1 

= 1.04- 1 
“ .04, or 4% 
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Precisely the same problem arises whenever rates of in- 
crease or decrease are to be averaged. The use of the arith- 
metic mean gives an incorrect result. 

THE GEOMETRIC MEAN AS A MEASURE OF CENTRAIi 
TENDENCY 

A question arises as to the type of frequency distribution 
the central tendency of which would be best represented 
by the geometric mean. When the absolute measures, 
plotted on the arithmetic scale, give a fairly symmetrical 
distribution, the arithmetic mean is clearly preferable to the 
geometric mean. But when the absolute figures thus plotted 
give an asymmetrical frequency curve of such a type that 
the as 3 unmetry would be removed and a symmetrical cmve 
secured by plotting the logarithms of the measures, the 
geometric mean would appear to be preferable. Such a 
distribution would be one in which not the absolute devia- 
tions about the central tendency but the relcUive deviations, 
the deviations as ratios, were symmetrical. The arithmetic 
mean of the logarithms of the various measures (which 
value is, as has been shown, the logarithm of the geometric 
mean of the original measures) would be the best representa- 
tive of the central tendency in such a distribution. The 
curve thus plotted would be symmetrical about the logarithm 
of the geometric mean. A frequency curve representing 
the logarithms of percentage changes in prices would tend 
to show this symmetry about the logarithm of the geometric 
mean of these changes. These percentage changes, as nat- 
ural numbers, group themselves in an asymmetrical form, 
with the range of deviations above the arithmetic mean 
greatly exceeding the range below. ^ This arises, of course, 
from the fact that prices of given commodities may increase 
1,000 per cent or more from a given base, but cannot fall 
more than 100 per cent from any given base. The section 


iCf. Kg. 61 . 
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on index numbers contains a fuller discussion of this partic- 
ular phase of the subject.^ 

The construction of a frequency distribution in which loga- 
rithms are tabulated would be laborious, if the logarithm 
of each item to be entered had to be determined, before 
tabulation. It is possible, however, with no great trouble to 
construct a true logarithmic distribution, with class-interval 
constant in terms of logarithms. The 66 quotations on pre- 
ferred stocks, tabulated in Table 29, range from 74 to 166. 
The logarithm of 74 is 1.86923; the logarithm of 166 is 
2.22011. The range, in logarithms, is .35088. We may 
select .06 as a suitable logarithmic class-interval, for the 
present purpose. For convenience in tabulating the data 
we set up two series of class limits, one in terms of logarithms, 
one in terms of the corresponding natural numbers. In 
constructing the distribution natural numbers may be tab- 
ulated, utilizing the class limits defined in natural terms. 
All subsequent calculations may be carried through in temos 
of logarithms. The distribution appears in Table 30 on 
page 131. 

If the geometric mean is considered appropriate for a given 
series, the type of distribution represented by Table 30 is 
more logical than that shown in Table 29, and the descrip- 
tive measurements secured from Table 30 have correspond- 
ingly greater validity. We may derive the mean of the 

‘ C. M. Walsh, in The Problem of Edimoiion (Ix)ndon, P. S. King & Son, 
1921) 35, lays down the following criteria for the use of averages: 

(а) When there are no conceivable or assignable upper or lower limits to the 

values of the terms in a series, the aritlimetic average should be em- 
ployed. 

(б) When there is a definite lower limit at or above zero and no upper conceiv- 

able or assignable limit, the geometric average should be employed. 
Because this is true of price changes Walsh believes the geometric 
average to be the correct one to use in making index numbiars of prices, 
(c) When in practice, or in the nature of things, certain upper and lower limits 
are found to exist and the above criteria cannot be employed, a study of 
the actual dispersion of the data is necessary. In this case, if the mode is 
found nearer to the arithmetic average, that average sliould be em- 
ployed; if the mode is found nearer to the geometric average, l^at aver- 
age should be used. 
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Table 30 

Distribution of Prices of Preferred Stocks 
Paying Seven Per Cent Dividends 


Class^nterval 
{natural numbers) 

Clasa-intervcd 

(logarWims) 

Mid-point 

(logarithms) 

m 

Frequency 

f 

fm 

S 70.80-8 81.27 

1 85-1 9099 

1.88 

2 

3 76 

81 28- 93.32 

1.91-1.9699 

1 94 

4 

7 76 

93.33- 107.15 

1 97-2.0299 

2 00 

12 

24 00 

107.16- 123.02 

2.03-2 0899 

2 06 

30 

61 80 

123.03- 141 24 

2.09-2.1499 

2.12 

6 

12 72 

141.25- 162.17 

2.15-2.2099 

2.18 

7 

15.26 

162.18- 186.20 

2.21-2.2699 

2.24 

5 

66 

11 20 
136 50 


logarithms of the preferred stock prices by dividiog S/m 
of Table 30 (136.50) by 66. The value is 2.06818. The 
anti-log of this is 116.97, which is the geometric mean of 
the distribution. This differs somewhat from the value 
$115.63 secured from Table 29. The difference is due, in 
part, to the use of different class-intervals and class limits 
in the two cases. With a relatively small number of observa- 
tions such differences would be expected to lead to different 
results. Differing assumptions concerning the internal dis- 
tribution of items within the several classes would also 
contribute to a discrepancy between the two results. The 
value obtained from Table 30 is probably a closer approximar 
tion to the actual geometric mean than that obtained from 
Table 29. 

A frequency curve based upon the logarithms of the 
measures included rather than upon the natural numbers, 
has been employed to advantage in plotting data relating 
to income distribution. When natural numbers are plotted, 
the range of income distribution is so large that it is phs^i- 
cally impossible to prepare a chart that will reveal the char- 
acteristic features of all sections of the curve. The process 
of plotting on double logarithmic paper (which is, of coimse, 
equivalent to plotting the logarithms of both ris and y% 
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meets this diflBculty, giving a true impression of the whole 
distribution and the relations between its parts, and, at the 
same time, brings out certain important features that are 
obscured in the natural scale chart. In particular, this 
device appears to smooth into a straight line that part of 
the curve lying above the mode, a fact which led Vilfredo 
Pareto to enunciate what has been known as Pareto’s Law 
concerning income distribution. An intensive study of the 
distribution of income in the United States has led the staff 
of the National Bureau of Economic Research to call into 
question certain conclusions drawn from Pareto’s generaliza- 
tions, though the value of the double logarithmic scale for 
the presentation of income data has been recognized. 

The Hahmonic Mean 

The harmonic mean is a type of average capable of 
application only within a restricted field, but which should 
be employed to avoid error in handling certain types of 
data. It must be used in the averaging of time rates and 
it has distinctive advantages in the manipulation of some 
types of price data. The following example wiU illustrate 
the method of emplo 3 dng this average. 

A given commodity is priced, in three different stores, at 
“four for a dollar,” “five for a dollar” and “twenty for a 
dollar.” The average price per imit is required. The arith- 
metic average of the figures given (4, 5, and 20) is 9|. If 
we take this to be the average number sold per dollar, the 
average price would appear to be $1.00 -s- 9|, or lOj^g- cents 
each. But the original quotations are equivalent to unit 
prices of 25 cents, 20 cents, and 5 cents; the arithmetic 
average of these prices is 16f cents apiece. The discrepancy 
between 10^^ cents and 16f cents is due to a faulty use 
of the arithmetic mean in averaging quotations in the “so 
many per dollar ” form. Such a mean is, in effect, a weighted 
average, with greater weight being given to quotations 
involving a larger number of commodity units. 
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The correct result may be secured by taJdng the harmouic 
mean of the three original quotations. The harmonic mean 
of a series of numbers is the reciprocal of the arithmetic mean 
of the reciprocals of the individual numbers. Thxis if we repre- 
sent the numbers to be averaged by ri,rs . . . r„, the formula 
for the harmonic mean, H, is 


H 


- + - + - + . . . + - 
ri ri r» r« 

N 


Using the figures just quoted: 


1 ^ 

H 


1 + 1 + 1 

4^5 20 


15 = 1 
60 6 


The harmonic mean of 4, 5, and 20 is 6, the average number 
of units sold per dollar. The average price per unit is 
16| cents. 

The computation of the harmonic mean of a series of 
magnitudes is greatly facilitated by the use of prepared 
tables of reciprocals.^ 


Relations between Diepebbnt Averages 

When different averages are located or computed for a 
given series of magnitudes, certain relationships between 
them are foimd to prevail. 

1. The arithmetic mean, the median and the mode coincide in 

a symmetrical distribution. 

2. In a moderately asymmetrical distribution the median lies 

between the mean and the mode, approximately one third 
of the distance along the scale from the former towards the 

I Barloufs Tablea of Squares, Cubes, Square Bools, Cube Boots and Beeiproeals, 
New York, Spar and Chamberlain. 
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latter. Hence, for this type of distribution there is an ap- 
proximation to the following relationship: 

Mo = M -ZiM - Md). 

3. The arithmetic mean of any series of magnitudes is greater 

than their geometric mean. 

4. The geometric mean of any series of magnitudes is greater 

than their harmonic mean. The only exception to the last 
two rules is found when all the measures in the series are 
equal, in which case arithmetic mean, geometric mean and 
harmonic mean are equal. 

5. The geometric mean of any two terms is equal to the geometric 

mean of the harmonic and arithmetic means of those terms. 
Thus if the terms be 2 and 8, the harmonic mean is the 
geometric mean 4, and the arithmetic mean 5. But 4 is also 
the geometric mean of 3i and 6. This relationship does not 
hold when the series includes more than two terms, unless 
the terms constitute a geometric series. 

6. When the dispersion of data follows the arithmetic law, the 

mode and median will generally be found closer to the 
arithmetic than to the geometric average. When the dis- 
persion follows the geometric law the mode and median will 
generally be found closer to the geometric than to the arithme- 
tic average. 

Chakacteeistic Features of the Chief Averages 
The arithmetic mean 

1. The value of the arithmetic mean is affected by every measure 

in the series. For certain purposes it is too much idfected by 
extreme deviations from the average. 

2. The arithmetic mean is easily calculated, and is determinate 

in every case. 

3. The arithmetic mean is a computed average, and hence is 

capable of algebraic manipulation. 

The median 

1. The value of the median is not affected by the magnitude of 

extreme deviations from the average. 

2. The median may be located when the items in a series are not 

capable of quantitative measurement. 

3. The median may be located when the data are incomplete, 

provided that the number and general location of aU the 
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cases be known, and that accurate information be available 
concerning the measures near the center of the distribution. 
4. The median is not as well adapted to algebraic manipulation 
as the arithmetic, geometric and harmonic means. 

The mode 

1. The value of the mode is not affected by the magnitude of 

extreme deviations from the average. 

2. The approximate mode is easy to locate but the determination 

of the true mode requires extended calculation. 

3. The mode has no significance unless the distribution includes a 

large number of measures and possesses a distinct central 
tendency. 

4. The mode is the average most typical of the distribution, 

being located at the point of greatest concentration. 

5. The mode is not capable of algebraic manipulation. 

The geometric mean 

1. The geometric mean gives less weight to extreme deviations 

than does the arithmetic mean. 

2. It is strictly determinate in averaging positive values. 

3. The geometric mean is the form of average to be used when rates 

of change or ratios between measures are to be averaged, 
as equal weight is given to equal ratios of change. It is par- 
ticularly well adapted to the averaging of ratios of price 
change. 

4. The geometric mean is capable of algebraic manipulation. 

The harmonic mean 

1. The harmonic mean is adapted to the averaging of time rates 

and certain similar terms. It has been employed in the 
field of economic statistics in the manipulation of price data. 

2. The labor of computing the harmonic mean and its unfanodliarity 

detract from its usefulness in ordinary statistical analysis. 

3. The harmonic mean is capable of algebraic manipulation. 

This summary has been designed to show that each 
type of average has its own particular field of usefulness. 
Each one is best for certain purposes and under certain 
conditions. The characteristics and limitations of each one 
should be understood in order that it may be appropriately 
employed. A complete description of a frequency distribu- 
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tion frequently calls for the determination of two or three 
of the chief averages, as well as other statistical measure- 
ments. The arithmetic mean is perhaps the most useful 
single average. The simplicity of its computation, the 
possibility of employing it in algebraic calculations and 
the fact that its meaning is perfectly definite and familiar 
make it highly serviceable in statistical work. Its sphere 
of usefulness is not universal, however, and it should only 
be employed when the given conditions render it suitable. 
A fuller appreciation of the distinctive virtues of the geo- 
metric mean is leading to a wider employment of that 
measure in many types of statistical work. A discriminat- 
ing use of averages is essential to sound statistical analysis. 
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CHAPTER V 


DESCRIPTION OF THE FREQUENCY 
DISTRIBUTION; MEASURES OF 
VARIATION AND SKEWNESS 

In the preceding chapters we have been concerned, first, 
with methods of reducing a mass of quantitative data to a 
form in which the characteristics of the mass as a whole 
may be readily determined and, in the second place, with 
methods of describing the assembled data. The first ob- 
ject is accomplished with the formation of a frequency 
distribution. The second is partially accomplished when 
there has been obtained a single significant value in the 
form of an average which represents the central tendency 
of the distribution. But any average, by itself, fails to give 
a complete description of a frequency distribution. Three 
other values are needed before the chief characteristics of a 
given distribution have been measured, and comparison 
with other distributions is possible. The first of these is a 
measure of the degree to which the items included in the 
original distribution depart or vary from the central value, 
the degree of “ scatter" variation or dispersion. The second 
is a measure of the degree of symmetry of the distribution, 
of the balance or lack of balance on the two sides of the 
central value. The third is a measme of kurtosis, of the de- 
gree to which there is a bimching of cases at the modal 
value. The present chapter deals with various measures of 
variation and skewness. The method of measuring kurtosis 
is referred to at a later point. 

Nature and Siqnotcance op Variation 

The fact of variation in collections of quantitative data 
has been pointed out in earlier sections and the bearing of 

18 T 
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this fact upon the work of the statistician indicated. Prac- 
tically every collection of quantitative data, consisting of 
measurements from the social, biological, or economic field, 
is characterized by variation, by quantitative differences 
among the individual units. And this fact of variation is 
as important as the fact of family resemblance. Biological 
variation has been a fundamental factor in the evolutionary 
process. No measurement of a physical characteristic of a 
racial group, such as height, is complete without an ac- 
companying measure of the average variation in the group 
in this respect. The average income in a country is perhaps 
of less significance than the variation in income, the differ- 
ences between the incomes received by different economic 
classes. Price variations interrupt the normal functioning 
of the economic system, causing hardship to some and 
giving unearned profits to others, because the various ele- 
ments in the price system are unequally affected. Not 
changes in the general level of prices but differences among 
changes in the prices of individual commodities and services 
cause trouble. 

An average, by itself, has little significance unless the 
degree of variation in the given frequency distribution is 
known. If the variation is so great that there is no pro- 
nounced central tendency an average has no significance. 
With a decrease in the degree of variation an average 
becomes increasingly significant. Whether a single fre- 
quency distribution is being described, therefore, or com- 
parison is being made with other distributions, a measine of 
central tendency must be supplemented by a measure of 
variation. 


MsAstnaBS of Absolute Variation 

Variation may be expressed in teims of the units of 
measurement employed for the original data, or may be 
expressed as an abstract figure, such as a percentage, which 
is independent of the original units. Ynxea the original 
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units are employed dbsohUe varwhility is measured; when 
an abstract figure is secured we have a measure of relative 
variability, more suitable for comparison than the former 
type. Measures of absolute variability are first considered. 

THE BANOE 

A rough measure of variation is afforded by the rarige, 
which is the absolute difference between the value of the 
smallest item and the value of the greatest item included 
in the distribution. Table 20 in Chapter IV shows the dis- 
tribution of London-New York monthly exchange rates 
during the period 1882-1913. The smallest item among the 
original figures included in the table is $4.83; the greatest 
is $4 . 908. The range, therefore, is $4 . 908-$4 . 83, or $ . 078. 
A distance on the scale equal to $.078 will include every 
item. If the original data were not to be had the range 
could be approximated from the frequency table. It would 
be the difference between the lower limit of the class at the 
lower extreme of the distribution, and the upper limit of 
the class at the upper extreme, or $.085 in the present 
case. 

The value of the range, it is obvious, depends upon the 
values of the two extreme cases only. A single abnormal 
item woiild change its value materially. Because it is 
erratic and is likely to be unrepresentative of the true 
distribution of items, it is seldom used in statistical work. 
The range is frequently employed as a measure of stock 
market fluctuations, though its adequacy for this purpose 
may be questioned. 

THE MEAH DEVIATION 

A more accurate measure of the diq)ersion of items about 
a central value is afforded by the simple device of measuring 
the deviation of each item from this central value and aver- 
aging these deviations. The simple example in Table 31 
illustrates the method of computation: 
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Tablb 31 


Comjruiation of Mean Deviation 


m 

/ 

d 


3 

1 

6 

II 

CD 

6 

1 

3 


9 

12 

1 

1 

0 

3 

MD. = ^ 
5 

15 

1 

6 



18 


The average (the mean and median coincide in this case) 
is 9. The deviations are added, taking no account of alge- 
braic signs, and the total divided by the number of items. 
This procedure is described by the expression 


In general terms, the mean deviation of a series of mag- 
nitudes is the arithmetic mean of their deviations from an 
average value (either mean or median). In the process of 
summation and averaging the algebraic signs of the devi- 
ations are disregarded. In practice it makes little differ- 
ence whether deviations be measured from the mean or 
the median. Theoretically the latter should be chosen, for 
the value of the mean deviation is least when the median 
is the point of reference. 

Table 32 illustrates the computation of the mean devi- 
ation when the data are grouped in a frequency distribu- 
tion.*^ In this work, as in certain other computations, we 
make the assumption that the items in each class-interval 
are uniformly distributed throughout that interval. 

The median hourly wage of the 4,216 steel workers repre- 
sented in this distribution is 48. 11 cents. The mean devia- 


1 Since the uses of the mean deviation are somewhat limited) the beginning 
student may well omit the remainder of the section on the moan deviation. 
After study of the more widely employed standard deviation the student may 
wish to return to the computation of the mean deviation of observations 
grouped in a frequency distribution. 



Table 32. CompvMion of Mean Deviation 
Average Hourly Earnings of Workers in Open-Hearth Furnaces 
in the Great Lakes and Middle West District in 1933 


Class-interval 
{in cents per 
hour) 

Mid- 

poira 

Fre- 

quency 

Deviation 

from 

arbitrary 





origin 




m 

f 

d 

fd 


25.0- 29 9 

27.5 

12 

20 

240 

c = 0 61 

30.0- 34.9 

32.5 

472 

15 

7,080 

(Median = 48.11 

35.0- 39.9 

37 5 

700 

10 

7,000 Arbitrary origin =*47.5 

40.0- 44.9 

42 5 

601 

5 

3,005 

c = 48.11 -47.6 = 0.61) 

45.0- 49.9 

47 6 

520 

0 

0 

50.0- 54.9 

52.5 

537 

5 

2,685 

JVo = No. of observations in 

55.0- 59.9 

57.5 

397 

10 

3,970 

classes above that 

60.0- 64.9 

62.5 

225 

15 

3,376 

containing the median 

65.0- 69.9 

67 5 

139 

20 

2,780 

= 1911 

70.0- 74.9 

72 5 

111 

25 

2,775 


75.0- 79.9 

77.5 

43 

30 

1,290 

Nb = No. of observations in 

80.0- 84.9 

82.5 

111 

35 

3,885 

classes below that 

85.0- 89.9 

87.5 

74 

40 

2,960 

containing the median 

90.0- 94.9 

92.5 

59 

45 

2,655 

= 1785 

95.0- 99.9 

97.5 

45 

50 

2,250 


100.0-104.9 

102.5 

61 

55 

2,805 Nm. « No. of observations in 

105.0-109.9 

107.5 

78 

60 

4,680 

the class-interval con- 

110.0-114.9 

112.5 

6 

65 

390 

taining the median 

115.0-119.9 

117.5 

17 

70 

1,190 

= 620 

120.0-124.9 

122.5 

1 

75 

75 


125.0-129 9 

127.5 

2 

80 

160 

1 = 6 

130.0-134.9 

132.5 

5 

85 

425 


135.0-139.9 

137.6 

7 

90 

630 

Calcviations 

140.0-144.9 

142.5 

1 

95 

95 


145.0-149,9 

147.5 

1 

100 

100 

(1) Sum of deviations from 

150.0-164.9 

152 5 

0 

105 

0 

arbitrary origin of all 

165.0-169.9 

157.6 

1 

4,216 

no 

no 

56,610 

observations in classes 
other than that contain- 
ing the median = 66,610 

Computation of median: 



N ^ 

2,108 




(2) {m - Na)c = - 76.86 

o “ 





jL 


1 323 

X (6.0) 1 



Md^ 



(8) N» 2i 


r 45.0 + 3.11 



(--«y 


= 48.11 




+ iVm — =■ 688.67 


Sum of detviations from median 

M.D. 


= 56,610 - 76.86 
+ 688.67 
« 57,221.81 
_ 57,221.81 
4,216 
« 13.673 
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tion could be computed directly, with reference to deviations 
from the median, but it is simpler to measure the deviations 
from the midpoint of the class containing the median, and 
then apply corrections to offset the resulting error. 

In Table 32 deviations have been measured not from 
48.11, the value of the median, but from 47.6, the midpoint 
of the class in which the median falls. Working with these 
measurements, the computations involve three steps: 

1. Obtaining the sum of the deviations from the assumed 
median of all items falling in classes other than that con- 
taining the true median. 

2. Correcting this sum for the error involved in the use of 
an origin other than the true median. 

3. Adding to the corrected sum the sum of the deviations 
from the median of the items within the class-interval con- 
tainiag the median. 

(1) The sum referred to in (1) is obtained directly, in the man- 
ner indicated in Table 32.^ It comes to 56,610. 

(2) The four classes below that containing the median con- 
tain 1,785 items. The deviation of each of these items from the 
true median, 48. 11, is greater by 0.61 than the deviations actu- 
ally recorded in Table 32, which are measured from 47 . 5. The 
measured deviations are too small by 0.61 for 1,785 items. The 
22 classes above that containing the median contain 1,911 itenos. 
For each of these the deviation from the true mean, 48 . 1, is less 
by 0.61 than the deviations actually recorded, which are meas- 
ured from 47.5. The measured deviations are too large by 
0-61 for 1,911 items. Accordingly the figure 56,610, which we 
have obtained as the sum of the deviations from the arbitrary 
origin of all the items in classes other than that containing the 
true median, must be corrected by the addition to it, algebra- 
ically, of -1- (1,785 X 0.61) and - (1,911 X 0.61). 

‘ This is not the sum of the deviations from 4.75, the arbitrary origin. For 
no account is taken of the deviations from that value of the 520 items falling 
within the class in question. If these are scattered uniformly throughout the 
class-interval they ^ contribute to the total of the deviations from 4.75. 
This would not be so if we were working on the assumption that all the items 
m a class are concentrated at the midpoint. In computing the mean devia- 
tion, however, it is necessary to make a different assumption, munely, that of 
uniform distribution throughout tiie dass-interval. 
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The corrections under point (2) may be defined more briefly. ‘ 
Let Na = number of items in classes above that containing the 
median, Nb = number of items in classes below that containing 
the median, and c = Md — 0, where Md is the median and 
0 is the arbitrary origin. The quantity c will, of course, be 
positive or negative, depending on the relative values of Md 
and 0, and this sign should be retained throughout the calcu- 
lations. The correction noted in (2) is then given by 

(Nb - Na)e 

which is to be added (algebraically) to the sum referred to in 
(1). In the present instance we have, as the required correction, 

(1,785 - 1,911) X ( + 0.61) = - 76.86. 

(3) Taking account of point (3) now, we must measure the 
deviations from the median of the 520 observations hitherto 
neglected. These are the observations falling within the class- 
interval that contains the median. This class-interval erctends, 
on the i-scale, from 45.0 to 50.0. The value of the median is 
48.11. If the 520 observations are uniformly distributed be- 
tween 45 . 0 and 50 . 0, the number falling between 45 . 0 and 48.11 
may be computed by the direct proportion 

520 = 323.4. 

Similarly, for the number of observations between 48.11 and 
50.0, we have 

^ X 520 = 196.6. 

On the assumption of uniform distribution, the average deviation 
from the median of the 323 . 4 observations falling between 45 . 0 

and 48.11 is 1.555 For the sum of the deviations 

of this group from the median, we have 

323.4 X 1.565 = 602.887. 

Similarly, the average deviation from the median of the 196.6 

/. 1.89\ 

observations falling between 48.11 and 60 . 0 is . 946 ( i.e., y 

' Cf. A Handbook of Mathematical StaH^ica, H. L. Bietz, editor, Boston, 
Houghton Miffln, 1924, 80. 
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For the sum of the deviations of this group from the median 
we have 

(196.6) X .945 = 185.787. 

The sum of the deviations from the median of all the observar 
tions in the class containing the median is 

502.887 + 185.787 = 688.674. 

In more general terms, the correction noted in (3) may be 
defined as follows. We have c = Md — 0; let i = class-inter- 
val and let Nm = number of observations in the class-interval 
in which the median lies. The sign of c must be retained in the 
calculations. For the number of items in that portion of this 
class-interval which falls below the median, we have 




The average deviation of these items from the median is 

* 1 

T C 

2 


The sum of the deviations from the median of the items in this 
segment of the class-interval containing the median is the 
product of these two quantities, or 

For the number of itms in that portion of this class-interval 
which lies above the median, we have 




The average deviation of these items from the median is 


■ t 

2 -^ 


2 
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The sum of the deviations from the median of the items in this 
s^ment of the class-interval containing the median is given by 



Accordingly, the total correction referred to under (3), on p. 142, 
or the sum of the deviations from the median of the items within 
the class-interval containing the median, is 



The nature of these formulas may be made clearer by insertion 
of the values in the example cited above. 


In the final computation of the mean deviation we must 
apply to the sum referred to under (1), on p. 142, the two cor- 
rections noted imder (2) and (3) on p. 142. From (1) we have 
66,610; the correction under (2) is — 76.86; the correction 
under (3) is + 688 . 67. The sum of the deviations from the 
median is, therefore, 57,221 . 81. For the mean deviation from 
the median, we have 


M.D. = 


57,221.81 

4,216 


13.673. 


The mean deviation from the mean may be computed by 
an identical process. 


THE STANDABD DEVIATION 

The process of calculating the mean deviation is alge- 
braically illogical because algebraic signs are disregarded. 
In the computation of the standard deviation this error is 
avoided and a measure of more precise mathematical sig- 
nificance is secured. The conventional ssnmbol for the 
standard deviation is the Greek letter sigma, a. 

In computing this measure the deviations of the indi- 
vidual items from the arithmetic mean are squared, totaled. 
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the mean of the squared deviations obtained, and the square 
root of this mean extracted. The standard deviation is, 
thus, the square root of the mean of the squared deviations. 
This measure is also termed the rootr-mean-square deviation, 
a useful name because it describes in full the method of 
calculation. The deviations are always measured from the 
arithmetic mean, as the value of the measure is a minimum 
under these conditions. A simple example will illustrate the 
process (Table 33). 


Table 33 

Computation of Standard Deviation 


m 

/ 

d 



3 

1 

- 6 

36 

Af = 9 

6 

1 

-3 

6 

9 

1 

0 

0 


12 

1 

+ 3 

9 


15 

1 

5 

+ 6 

36 

90 

f V 

« V18 
(T « 4,24 


When the standard deviation is computed from ungrouped 
data, as in this case, the formula ^ is 



When the items are grouped in a frequency distribution 
the task of computation is a little more complicated. The 
measurement of deviations from an arbitrary origin is essen- 
tial in this case, as it greatly simplifies the calculations. 

1 This formula is used in statistical description, which is the concern of this 
section of the book. If our purpose is to use results secured from a sample as 
estimates of the attributes of the population from which the sample has been 
drawn, a slight modification is desirable. It has been shown that the estimate 
of the true standard deviation is improved if be used as the divisor in the 

formula, in place of N, The difference is slight for estimates based on large 
samples, important for small ones. 
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The general formula for the standard deviation is 



where f represents the class-frequencies, d the deviations 
from the arithmetic mean and N the number of cases 
included. It follows that 



If a deviation from an arbitrary origin be represented by 
d' and the root-mean-square deviation from this origin be 
represented by s#, we have 

- jy • 

The root-mean-square deviation from the mean (<r) is less 
than the root-mean-square deviation from any other point 
on the scale. Hence Sa* is greater than (t\ We may repre- 
sent by c the difference between the true mean and the 
arbitrary origin. It may be readily established * that 

O’* = So* — c*. 

The value of the standard deviation may be most easily 
determined, therefore, by computing So® and c®. The opera- 
tions involved are illustrated in detail in Table 34, showing 
the distribution of 11,404 steel workers, classified on the 
basis of average hourly earnings in 1933. 


Sd* 

^ For (T* aa -rr- 
N 

but Zd B 0 

N 

.•.2(d')*=-2!d*+JVc* 

d' <^d+e 

N 

(<i0» - d* -t-2ai +c* 

« O'* + c* 

2(d')» - 2d*-l-2cSd+J\rc* 

<jp* » — c*. 



Table 34. Computation of Standard Deviation 
Average Hourly Earnings of Workers in Open-Hearth Furnaces 





in 1933 




(1) 

(2) 

(3) 

(4) 


(5) 

(6) 

(7) 

(8) 

Class- 

Mid- 

point 

{cents) 


Deviation 






interval 

Fre- 

from 






{cents per 
hour) 

quency 

arbitrary 

origin 







m 

f 

d' 


Id' 

/(dO* 

id'+iy 

/(d'+l)“ 

15.0- 19.9 

17.5 

41 

- 9 

— 

369 

3,321 

64 

2,624 

20.0- 24.9 

22.5 

54 

- 8 

— 

432 

3,456 

49 

2,646 

25 0- 29.9 

27.5 

342 

- 7 

— 

2,394 

16,758 

36 

12,312 

30.0- 34.9 

32.5 

1,158 

- 6 

— 

6,948 

41,688 

25 

28,950 

35.0- 39 9 

37.5 

2,103 

- 5 

— 

10,515 

8,252 

52,575 

16 

33,648 

40 0- 44.9 

42.5 

2,063 

- 4 

— 

33,008 

9 

18,567 

45.0- 49 9 

47.5 

1,433 

- 3 

— 

4,299 

12,897 

4 

5,732 

50.0- 54.9 

52.5 

1,131 

- 2 

— 

2,262 

775 

4,524 

1 

1,131 

55.0- 59.9 

57.5 

775 

- 1 

— 

775 

0 

0 

60.0- 64.9 

62.5 

478 

0 


0 

0 

1 

478 

65.0- 69.9 

67.5 

457 

1 


457 

457 

4 

1,828 

70.0- 74.9 

72.5 

304 

2 


608 

1,216 

9 

2,736 

75.0- 79.9 

77.5 

216 

3 


648 

1,944 

16 

3,456 

80.0- 84.9 

82.5 

193 

4 


772 

3,088 

25 

4,825 

85.0- 89 9 

87.5 

117 

5 


585 

2,925 

36 

4,212 

90.0- 94.9 

92.5 

111 

6 


666 

3,996 

49 

5,439 

95.0- 99.9 

97.5 

62 

7 


434 

3,038 

64 

3,968 

5,751 

100.0-104.9 

102.5 

71 

8 


568 

4,544 

81 

105.0-109.9 

107.5 

103 

9 


927 

8,343 

100 

10,300 

110 0-114.9 

112.5 

34 

10 


340 

3,400 

121 

4,114 

115.0-119.9 

117.5 

58 

11 


638 

7,018 

144 

8,352 

120 0-124.9 

122.5 

27 

12 


324 

3,888 

169 

4,563 

125.0-129 9 

127.5 

19 

13 


247 

3,211 

196 

3,724 

4,275 

130.0-134.9 

132.5 

19 

14 


266 

3,724 

3,150 

225 

135.0-139.9 

137.5 

14 

15 


210 

256 

3,584 

140.0-144.9 

142 5 

12 

16 


192 

3,072 

289 

3,468 

145.0-149.9 

147.5 

2 

17 


34 

578 

324 

648 

150.0-154.9 

152.5 

4 

18 


72 

1,296 

361 

1,444 

800 

155.0-159.9 

157.5 

2 

19 


38 

722 

400 

160.0-164.9 

162.5 

1 

20 


20 

400 

441 

441 

Total 


11,404 



28,200 229,012 


184,016 


N - 11,404 

Class-interval » 5.0 cents 


c (in class-interval units) = - “ — 2.4728 

c* (in class-interval. units) « +6.1147 

. , . . , . W)* 229,012 

sa* (m class-interval' units) « — ~ ^ ” 20.0817 

0 'S (in class-interval units) = sa* — c* » 20.0817 — 6.1147 « 13.9670 
a (in class-interval units) sai 3.737 

er (in original units) = 3.737 X 5.0 cents « 18.685 cents. 

148 
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The entire calculation, it will be noted, is carried through 
in terms of class-interval units, the resiilt being reduced to 
the original miits in the final operation. In computing c, 
the difference between the true mean and the arbitrary 
origin, the algebraic sum of the deviations is divided by the 
number of cases. The arithmetic mean could be deter- 
mined by reducing c to original imits and adding this value 
(algebraically) to the value of the arbitrary quantity se- 
lected as origin, but this is not an essential step. The actual 
value of the mean need not be known in the computation of 
the standard deviation. 

A check upon the accuracy of the calculations (the Charlier 
cheek is afforded by the figures in cols. (7) and (8) of 
Table 34. If deviations be measured, not from the arbitrary 
origin employed in computing the standard deviation, but 
from an oiigm one class-interval below, we secure a set of 
values equal to d' + 1. The squares of these values are 
given in col. (7). Multiplying by the corresponding fre- 
quencies we have the quantities recorded in col. (8), the 
sum of which is 184,016. This total stands in a definite 
relationship to the values secured in computing the standard 
deviation. For 

S/(d' + 1)* = S/[(dO' + 2(i' -f- 1] 

= S/(d')* + 2S/d' -t- S/ 
or S/(d' + 1)* = S/(dO* + 2S/d' -H N. 

Inserting in this last equation the values seemed from 
the calculations shown in Table 34, we obtain this check: 

184,016 = 229,012 -|- 2(- 28,200) -|- 11,404 
= 184,016. 

The following is a summary of the steps in the process of 
computing the standard deviation of items grouped in a 
frequency distribution: 

^ Cf. C. V. L. Charlier, Vorlesungen tjfher Die Qrundsff&ge D&r Mathematiachen 
StoHstikj Lynd, Verlag ^entia, 1920, 19. 
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1. Select ae arbitrary origin the mid-point of a class near the center 

of the distribution. 

2. Measure the deviations from this point of the items in each class, 

in class-interval units. Multiply the deviations by the corre- 
^nding class-frequencies. 

3. Divide the algebraic sum of the deviations by N. This gives c, 

in class-interval units. Compute c*. 

4. Square the deviations and multiply by the corresponding class- 

frequencies. 

5. Divide the sum of the squared deviations by N. This gives s**, 

in class-interval units. 

6. From the formula, <t^ = Sa* — c*, compute e®. Extract the 

square root of this value, securing a in class-interval units. 

7. Multiply ff, as thus computed, by the class-interval. The result 

is ff in the original units of measurement. 

Certain of the characteristics of the standard deviation 
and its relation to other measures of dispersion are described 
in a later section ^ 

THE QtJARTinB DEVIATION 

In the chapter on averages methods of locating the quar- 
tiles and deciles were described. The former are those points 
on the scale of values, along which the items of a given 
distribution lie, which divide the total number of items into 
four equal groups. The deciles are those points dividing the 
total number of items into ten equal groups. The degree 
and character of the variation in a frequency distribution 
may be accurately described if the location of the quartiles 
and deciles is shown. Such knowledge, however, while 
helpful in giving a picture of the distribution, is not as use- 
ful for purposes of concise description and comparison as 
knowledge of the values of the mean deviation or the stand- 
ard deviation. The significance of a single measure is more 
readily grasped than is the meaning of a number of inter- 
related values. Such a measure of variation may be com- 
puted from the quartiles, however. With regard to ease of 

^A ooirection to be applied to the staadard deviation in certain cases 
(Sheppard’s correction) is described in Chapt^ XIII. 
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calculation and immediate significance this quartiU deviation 
has distinct merits. 

Within the range between the two quartiles, of course, 
one half of all the measures are included. The greater the 
concentration the smaller this interval, hence a fairly accu- 
rate measure of dispersion may be obtained from the rela- 
tionship between these two quartiles. The quartile deviation 
is the semirinterquartile range, half the distance along the 
scale between the first and third quartiles. Thus if Q.D. 
represent the quartile deviation, Qi the first quartile and 
Qi the third quartile. 


If the value of a point on the scale half-way between 
the first and third quartiles is represented by K, one half 
of aU the measures in a frequency distribution will fall 
within the range K ± Q.D. For the data in Table 32, 
relating to the hourly earnings in 1933 of steel workers in 
the Great Lakes and Middle West District, we have 

<3s = 39.07 

Qi = 59.03 

___ 59.03 - 39.07 
2 

= 9.98 

A: = 39.07 + 9.98 
= 49.05. 


Thus one half of all the measiues lie within the range 
49.05 ± 9.98. This statement, together with a statement 
of the average hourly earnings in 1933 (mean, median, or 
mode), constitutes a useful description of the distribution. 
In a perfectly symmetrical distribution the value of K will 
coincide with the value of the median (that is, the median 
will lie half-way along the scale from Qi to Qs). The dis- 
tribution of wage 'rates is sli^tly asymmetrical, the value of 



152 FREQUENCY DISTRIBUTION 


the median being 48.11, as compared with the value of 
49.05 for K. 


THE PEOBABLE ERROR 

In studying the results of astronomical and other physi- 
cal measurements it has been found that the values secured 
by different observers for the same constant quantity vary. 
These varying results, however, are distributed in a certain 
definite way, and when plotted give a curve similar to the 
normal curve of error. In such cases there is an immediate 
and obvious need of some measure of variation which may 
be used as an index of the reliability of given results. If 
the results secured by different investigators, or by the 
same investigator at different times, vary widely they can- 
not be accepted as reliable, while the reverse is true if the 
variation is slight. The measure of dispersion which has 
been generally employed in such cases is termed the prob- 
(Me error. The probable error is that amount which, in a 
given case, is exceeded by the errors of one half the ob- 
servations. Since the most probable value of a given series 
of observations is their arithmetic mean, the probable error 
is always measured from the mean. The name of this 
measure derives from the fact that the probability that a 
given observation will vary from the mean of all the ob- 
servations by an amount greater than the probable error 
is exactly |. It follows that, when the observations are 
arranged in the form of a frequency distribution, a distance 
equal to the probable error laid off on each side of the arith- 
metic mean will define limits within which one half of the 
total number of cases will fall. 

This measure of variation has been employed in fields 
other than that in which it was originally applied, fields in 
which the name probable error is somewhat misleading. In 
such cases it is perhaps better to think of it as the probable 
deviation, that distance from the mean which will be ex- 
ceeded by one half of the total deviations. 
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The probable error is a measure of dispersion which is 
fully significant only when it applies to a distribution fol- 
lowing the normal law of error. In such cases it has a 
definite and precise meaning. This is not so when it is 
applied to skew distributions, and its use in such cases 
is not advisable. The quartile deviation, the value of which 
is equal to that of the probable error in a normal distribu- 
tion, has a more direct significance than the probable error 
in the description of abnormal distributions, and should be 
employed in such cases. In a later section the use of the 
probable error as a measure of the reliability of statistical 
results is more fully explained. 

The value of the probable error in a given case, assuming 
a nonnal distribution to prevail, may be detennined from 
the value of the standard deviation, for there is a constant 
relationship between these two. This is expressed by the 
formula: P.E. = 0.6745<r. 


Relations between Diffebent Measures of 
Variation 

An imderstanding of the significance of the various meas- 
ures of dispersion described above may be facilitated by a 
general comparison and a summary statement of the relar 
tions holding between them. 

1. The range is a distance along the scale within which all the 

observations lie. 

2. The gmrtHe deviation or mwiArderqumtiU range is a distance 

along the scale which, when laid off on each side of the point 
•midway between the two quartiles, includes one half the total 
number of observations. 

3. The mean deviation from the mean, in a normal or sliiditly 

skew distribution, is equal to about 4 of the standard deviar 
tion. A range of 7i times the mean deviation, centering at the 
mean, will include approximately 99 per cent of all the cases. 

4. When a distance equ^ to the standard deviation is laid off on 

each side of the mean, in a normal or only slightly skew dis- 
tribution, about two thirds of all the cases will be included. 
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(In the normal distribution exactly 68.26 per cent of the obser- 
vations will be included.) When a distance equal to twice the 
standard deviation is laid off on each side of the mean approxi- 
mately 95 per cent of the cases will be included (exactly 95 . 46 
per cent in a normal distribution). When a distance equal to 
three times the standard deviation is laid off on each side of 
the mean about 99 per cent of all the observations will be 
included (exactly 99.73 per cent in a normal distribution). 
This general rule that a range of six times the standard devia- 
tion, centering at the mean, will include about 99 per cent of 
all the measures furnishes a useful check upon calculations. 

A study of Fig. 45 may help to make clear the significance of 
the standard deviation in a normal distribution. 

5. The probable error, in a normal distribution, is equal to 0.6745(r. 
A range of twice the probable error, centering at the mean, 
will include 50 per cent of all the observations. A range of 
eight times the probable error, centering at the mean, will 
include approximately 99 per cent of all the observations. 

Chaeactbeistic Featukbs OB’ THE Chief Mbabuees 
OP Vaeiation 

The range 

1. The range is easily calculated and its significance is readily 

imderstood. As a rough measure of the degree of variation 
the range is useful. 

2. The value of the range is determined by the values of the two 

extreme cases. It is thus a highly imstablc measure, the 
value of which may be greatly changed by the addition or 
withdrawal of a single figure. 

3. This measure gives no indication of the character of the distri- 

bution within the two extreme observations. 

The quartiU deviaiion 

1. The quartile deviation is a measure of dispersion that is easily 

computed and readily understood. It is superior to the range 
as a rough measure of variation. 

2. The quartile deviation is not a measure of the variation from 

any specific average. - 

3. This measure is not affected by the distribution of the items 

between the first and third quartiles, or by the distribution 
outride the quartiles. The values of the quartile deviation 
might be the same for two quite dissimilar distributions, pro- 
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vided the quartiles happened to coincide. Because it is not 
affected by the deviations of individual items it cannot be 
accepted as an accurate measure of variation. 

4. The quartile deviation is not suited to algebraic treatment. 

The mean deviation 

1. The mean deviation is affected by the value of every observa- 

tion. As the average difference between the individual items 
and the median (or mean) of the distribution it has a precise 
significance. 

2. The mean deviation is less affected by extreme deviations than 

the standard deviation. 

3. Mathematically, the mean deviation is not as logical or as con- 

venient a measure of dispemion as the standard deviation. 

The standard deviation 

1. The standard deviation is affected by the value of every ob- 

servation. 

2. The process of squaring the deviations before adding avoids the 

algebraic fallacy of disregarding signs. 

3. The standard deviation has a definite mathematical meaning 

and is perfectly adapted to algebraic treatment. 

4. The standard deviation is, in general, less affected by fluctua- 

tions of sampling than the other measures of dispersion. 

6. The normal curve of error has been analyzed in terms of the 
standard deviation. The information thus obtained has 
increased greatly the utility of the standard deviation. 

The probable error 

1. The probable error has a definite meaning in the case of a dis- 

tribution following the normal law. It has not this precise 
meaning for other distributions, and should not be employed 
in describing them. 

2. For distributions to which it is adapted, the probable error is an 

extremely useful measure. Its most important use is as an 
index of the magnitude of errors of sampling. 

3. The definite relationship between the probable error and the 

standard deviation, for a normal distribution, permits the 
value of the probable error to be readily determined. 

All the measures of variation described above may be 
utilized for particular purposes. The standard deviation, 
however, is the best general measure and should be em- 
ployed in all cases where a hi^ degree of accuracy is re- 
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quired. The probable error is, in efiFect, merely a fractional 
part'Of the standard deviation, with a definite but restricted 
field of usefulness. 

The Mbastjuement of Relative Variation 

We have been dealing in the preceding section with 
absolute variability. The various measures of dispersion 
secured by the methods outlined describe the variability 
of the data in terms of absolute units of measurement. 
The standard deviation of London-Paris exchange rates is 
in francs, the standard deviation of pig iron production in 
tons, etc. If the object in a given case is the description of 
a single frequency distribution it is desirable that the orig- 
inal unit be employed throughout, but if measures of varia- 
tion of two different distributions are to be compared, difiB- 
culties are encountered. This is clear if the units axe unlike, 
but even if the units are identical the same difficulty arises. 
Thus measures of variation in the weights of dogs and in the 
weights of horses might both have been computed in pounds. 
Because the standard deviation of horse weights is greater 
than the standard deviation of dog weights, it does not fol- 
low that the degree of variability is greater in the former 
case. A measure of absolute variation is significant only in 
relation to the average from which the deviations are meas- 
ured. Its use, apart from this average, is meaningless. For 
comparison, therefore, it must be reduced to a relative form, 
and the obvious procedure is to express a given measure 
of variation as a percentage of the average from which the 
deviations have been measured. The quantity thus becomes 
an abstract number, a measure of the relative variability 
of the given observations, and may be compared with similar 
terms computed from other distributions. 

THE COEFFICIENT OF VARIATION 

The measure of relative variation most commonly em- 
ployed is that developed by Pearson, termed the coeffiAsimt 
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of variation, and represented by the letter V. It is simply 
the standard deviation as a percentage of the arithmetic 
mean. Thus 


F = ^ X 100. 
M 


Applying this formula to the results secured from the an- 
alysis of the distribution of steel workers, classified accord- 
ing to hourly earnings in 1933 (Table 34), we have 


18.685 

50.136 


X 100 


= 37.27%. 


This measurement may be compared with a similar coeffi- 
cient relating to the distribution of workers in open-hearth 
furnaces, classified according to average hourly eariungs in 
1935. In that year the mean wage was 71 . 946 cents and the 
standard deviation 28 . 55 cents. From these 


28.55 

71.946 


X 100 


= 39.68%. 


Variations of hourly earnings among steel workers was 
greater in 1935 than in 1933. The difference was not as 
great, however, as a comparison of standard deviations would 
indicate. The average wage advanced appreciably between 
1933 and 1935 and the relative variation increased only 
moderately. 

An index of variability similar to this coefficient might 
be secured by expressing any of the other measures of 
deviation as a percentage of the average from which the 
deviations were computed. Pearson’s coefficient has been 
generally adopted, however, and is the only one in wide use. 


Mssasubes of Skewness 

Methods have been developed in the preceding sections 
for describing the central tendency of a frequency distri- 
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bution and for measirring the degree of concentration or 
lack of concentration about that central tendency. One 
further measure is needed, and that is one which indicates 
the degree of skewness or asymmetry of a given distri- 
bution. For it is essential to know, in regard to a given 
distribution, whether the observations are arranged sym- 
metrically about the central value, or are dispersed in an 
uneven, asymmetrical fashion about that value. Having 
such a figure it will be possible effectively to summarize 
the characteristics of a frequency distribution in three sim- 
ple terms — an average, a measure of dispersion and a 
measure of skewness. There are two measures of skewness 
in current use. 

If a frequency curve is perfectly symmetrical, mean, 
median, and mode wfil coincide. As the distribution de- 
parts from symmetry these three values are pulled apart, 
the difference between the mean and the mode being great- 
est. This difference may be used, therefore, as a measure 
of skewness. It is desirable in this case, as in measuring 
relative variability, to secure an index in the form of an 
abstract number, which may be compared with similar fig- 
ures derived from other distributions. To tins end, Pearson 
has proposed dividing the absolute difference between mean 
and mode by the standard deviation of the given distribu- 
tion. His formula is 

sk (skewness) = — 

In a symmetrical distribution, where mean and mode coin- 
cide, the value of this measure will be zero. Under other 
conditions the value may be positive or negative, depending 
upon the relative positions of the two averages on the scale. 

For moderately skew distributions the degree of skew- 
ness may be computed more readily from the formula 

ak » 


Z(M - Md) 
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This corresponds approximately to the other formula, be- 
cause of the fact that in a moderately assunmetrical distri- 
bution the median lies between the mean and the mode, 
about one third of the distance from the former towards 
the latter. 

Because it is difficult to locate the mode by simple meth- 
ods, a measure of skewness more easily computed than 
Pearson’s is desirable in some cases. Bowley has proposed 
such a method, based upon the relationship between the 
first and third quartiles and the median. If the distribution 
is symmetrical these two quartiles will be equidistant from 
the median; with an asymmetrical distribution this is not 
so. Therefore, if we let represent the difference between 
the upper quartile and the median and qi represent the 
difference between the median and the lower quartile, we 
may use the formula 

22 + ?i 

as a means of seeming a measure of skewness. This value 
will vary between 0 and ± 1. For with perfect symmetry 
32 = 3i, and the measme is 0; with asymmetry so pro- 
nounced that the median and one of the quartiles coincide, 
either qa or qi becomes equal to 0, and the formula gives 
a value of + 1 or — 1. Bowley suggests that a value of .1 
indicates a moderate degree of skewness, while a value of 
.3 indicates marked skewness. 

The values secured from this measure are not, of course, 
comparable with the values secured from the application 
of Pearson’s formula for measuring skewness. 

ETJBTOSIS 

Reference has been made to a fourth measurable char- 
acteristic of frequency curves. This is the degree of flat- 
toppedness, as compared with the normal curve. A measure 
of kurtosis, the technical term for this characteristic, is 
g^ven in Chapter XIII. 
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CHAPTER VI 


INDEX NUMBERS OF PRICES 
The Nature op Index Numbers 

The term “index number” has been applied to a number 
of somewhat similar devices employed in the analysis of 
statistical series. Index numbers have been most widely 
used in the study of price changes, but a brief considera- 
tion of certain other uses may make clear the essential 
characteristics of such measures. In its simplest form this 
name is applied to a term in a time series expressed as a 
relative number. Thus an index number of cotton consiimp- 
tion in the United States might take the following form : 

Table 85 


Domestic Cotton Consumption in the United Stales, 
1926-1936 

(Consumption in year ended July 31, 1926 — 100) 


Year ended 
July 31 

Cotton consumption 
(unit: one thousand 

Cotton consumption 
relative 

running hales) 


1926 

6,456 

100 

1927 

7,190 

111 

1928 

6,834 

106 

1929 

7,091 

110 

1930 

6,106 

95 

1931 

5,263 

82 

1932 

4,866 

75 

1938 

6,137 

95 

1934 

5,700 

88 

1935 

5,361 

83 

1936 

6,351 

98 


Siinilarly the price of a commodity may be expressed as 
a relative, the price at a given date or for a given period 
serving as base. 


161 



162 INDEX NUMBERS OF PRICES 


Tabli! 36 

Average Price of No. 1 Northern Spring Wheat, Minneapolis 
1913, 1929-1936 

(Average price in year ended June 30, 1913 =* 100) 


Calendar 

Weighted average 

Relative 

year 

price per huahd 

•price 

1913 

$0,874 

100 

1929 

1.276 

146 

1930 

0.984 

113 

1931 

0.739 

85 

1932 

0.605 

69 

1933 

0 770 

88 

1934 

1.026 

117 

1935 

1.166 

133 

1936 

1 247 

143 


The representation of the terms in a time series as rela- 
tives, with reference to a fixed base, makes possible a ready 
comparison of the values for different dates and enables one 
to follow the trend of the series much more easily than 
when the data are presented in their original form. Compari- 
son of the trends of different series is also facilitated. 

Though the term index number has been applied to such 
relatives it is better practice to reserve the term for figures 
which represent the combination of a number of series. 
The series to be combined may relate to prices, production, 
consumption, wages, volume of trade, or to any factor sub- 
ject to temporal variation. (Index numbers have been used 
also in measuring such geographical differences as arise 
from variations in living costs from city to city or from 
country to country.) Quite complex problems may be in- 
volved in the construction of any one of these special forms 
of index numbers, but the essential aim in all cases is to 
secure a single, simple series that will define the net resultants 
of the changes occurring in the constituent elements. 

A simple index number may be constructed to represent 
the course of coal and petroleum production in the United 
States. In the making of such an index it is necessary to 
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combine in some way production figures for bituminous 
and anthracite coal and petroleum. The production figures 
and the corresponding relatives for the three series, from 
1922 to 1936, are given in Table 37. 


Table 37 


Production of Biiumimm and Ardkracite Coal and Petroleum 
in the United States, 1922-1936 
(Production in 1922 = 100) 



Prod, of 


Prod, of 


Prod, of 


Year 

bit. coat 
(million 

Rd. 

anthr. coal 
(miUion 

Rd. 

petrol. 

(million 

Rd. 


ah. tons) 


sh. tons) 


ms.) 


1922 

422.3 

100 

64 7 

100 

667.5 

100 

1923 

664.6 

134 

93.3 

171 

732.4 

131 

1924 

483 7 

115 

87.9 

161 

713.9 

128 

1926 

620.1 

123 

61.8 

113 

763.7 

137 

1926 

573.4 

136 

84 4 

164 

770 9 

138 

1927 

617.8 

123 

80 1 

146 

901 1 

162 

1928 

500.7 

119 

75 3 

138 

901.6 

162 

1929 

535.0 

127 

73.8 

135 

1,007.3 

181 

1930 

467.5 

111 

69.4 

127 

898.0 

161 

1931 

382.1 

90 

69 6 

109 

851.1 

163 

1932 

309 7 

73 

49.9 

91 

785.2 

141 

1933 

333.6 

79 

49.6 

90 

905.7 

162 

1934 

369.4 

86 

57.2 

106 

908.1 

163 

1936 

372 A 

88 

62.2 

96 

996.6 

179 

1936 

434.1 

103 

64 8 

100 

1,098.5 

197 


A rough index of fuel production, based upon these three 
series, is desired. It is impossible, obviously, to add the 
original figures, as the units are not the same. Thu diSi- 
culty may be avoided by using the relative figures. A simple 
average of the three relatives for a given year may serve 
as the required index. Index numbers thus secured are 
given in Table 38 on page 164. 

In securing this index, by adding the three relative fig- 
ures for a given year and ^viding by three, equal weight 
has been given to each of the three series. Such an index of 
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Table 38 

Index Numbers of Coal and Petroleum Production in the 
United States, 1922-1936 
(Production in 1922 = 100) 


Year 

Index 

Year 

Index 

1922 

100 

1930 

133 

1923 

145 

1931 

117 

1924 

135 

1932 

102 

1925 

124 

1933 

no 

1926 

143 

1934 

118 

1927 

144 

1935 

121 

1928 

140 

1936 

133 

1929 

148 




equally weighted relatives has been termed an unweighted 
index, but the term is misleading. Weights are used, the 
weights in this case being equal. It is clear that this index 
based upon equal weights does not reflect faithfully the 
three series combined in the present instance. For the three 
series are not of equal importance, as the system of equal 
weights assumes. The following figures showing the whole- 
sale values in exchange in 1926 of bituminous coal, anthra- 
cite coal, and crude petroleum indicate the relative im- 
portance of the three series: ^ 


Mineral 

Bituminous coal 
Anthracite coal 
Petroleum 


Wholesale value in 
exchange in 1926 

$2,157,740,000 

888,141,000 

1,355,989,000 


Boixghly, these stand to one another in the relation of 
5, 2, and 3, and these weights may be assigned to the series 
under consideration. An index for each year may be com- 
puted, using these weights. The example in Table 39, 
showing the calculations for the years 1922 and 1923, will 
illustrate the method. 


‘ The figutes have been compiled by the T7. S. Bureau of Labor Statistics. 
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Table 39 


Computation of Weighted Index Nunibera of Coal 
and Petroleum Production 


Mineral 

RdaUve 

'production 

1922 

wt. 

WL X M. 

Relative 

production 

1923 


Wt X Bd. 

Bituminous coal 

100 

5 


134 

5 

670 

Anthracite coal 

100 

2 


171 

2 

342 

Petroleum 

100 

3 


131 

3 

393 






10 

1,405 


Index of fuel production, 1922 »• 1,000 -s- 10 = 100 
Index of fuel production, 1923 = 1,405 10 = 141 


The value of the index thus secured for each of the fif- 
teen years covered is shown in Table 40. 


Table 40 


Weighted Index Numbers of Coal and Petroleum 
Production in the United States, 1922-1936 


Fear 

Indez 

Year 

Index 

1922 

100 

1930 

129 

1923 

141 

1931 

113 

1924 

128 

1932 

97 

1925 

125 

1933 

106 

1926 

140 

1934 

112 

1927 

139 

1935 

117 

1928 

136 

1936 

131 

1929 

145 




Differences between the two series of index numbers are 
to be expected. The second series, which is the more log- 
ically weighted, is, of course, the more accurate of the two, 
and gives a more faithful representation of the combined 
effect of the forces affecting the output of coal and petroleum. 

Another type of index number is one in which the items 
in the constituent series are totaled, the aggregate figure, 
instead of an average, serving as the representative of the 
entire group. Such a form of index niunber may be con- 
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structed only when the different series are all expressed in 
the same unit. This form is frequently employed as an 
indication of changes in the level of prices, the aggregate 
cost of a bill of goods at one period being compared with 
the aggregate cost of the same goods at other dates. The 
figures in Table 41 illustrate this type of index. 


Table 41 


Bradstreet’a Index of Wholesale Prices in the 
United States, 1926-1937 * 


Year 

Index 

1926 

13 02 

1927 

12 78 

1928 

13.28 

1929 

12 67 

1930 

10.75 

1931 

8.76 


Year 

Index 

1932 

7.10 

1933 

7 86 

1934 

9.22 

1935 

9.92 

1936 

10 10 

1937 

11.06 


Each of the yearly aggregates quoted above is the siim 
of the average prices during the year of 96 commodities at 
wholesale. Before being added all the prices are reduced to 
the “per pound” basis, so that a certain degree of compara- 
bility is secured. Such an index may be readily changed to 
the relative form, any year being taken as a base and the 
totals for the other years expressed as percentages of the 
figure for the base year. 

The examples which have been given will indicate some 
of the many forms which index numbers may take. The 
term may refer to a simple relative number; it may be 
applied to an average of relative terms, or to an aggregate 
of relative or absolute figures. In all the examples given 
the index has been designed to serve as a measure of change 
over a period, as an indicator of changes in the values of 
time series. The term may have a much broader meaning 
than this. An index of the ability of salesmen might be 
constructed by giving numerical values to the factors deter- 

^ Construction of this index was discontinued at the end of 1937, 
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mi ning their usefulness and securing an average of these 
values. An index of the efficiency of different departments 
in a business enterprise might be constructed. In any case, 
the construction of an index involves the reduction to com- 
parable terms of a number of different factors and the 
replacement of these several terms by a single figure which 
may serve as their representative. Comparison is thus 
facilitated, whether it be comparison over time or space, or 
comparison with other indices secured by averaging terms 
relating to a similar unit. In all its forms (except the first 
limited and exceptional meaning m which it applies to a 
simple relative) an index number is thus a type of statistical 
average, and such numbers, in their construction and use, 
are subject to all the rules and limitations set forth in the 
development of the subject of averages. 

In the present work we are interested only in the applica- 
tion of the index number device to time series. So varied, 
however, are the rules and practices relating to its applica- 
tion to different types of time series that certain of these 
types must be treated separately. Our first concern is with 
index numbers of wholesale prices. 

Price Changes 

When price movements are surveyed in detail it is diffi- 
cult to perceive order, or any definite trend. We find a mul- 
tiplicity of conflicting movements. The price quotations 
in Table 42 (on page 168), taken at random, are roughly 
typical of what would be found were the entire field of prices 
canvassed in order to compare price movements from month 
to month. 

Of the sixteen commodities listed, five showed no price 
change at all between October and November, 1937, two 
showed price increases, and in nine cases prices declined. 
Some of the price movements were inconsiderable, while 
some marked very material changes. Such, as seen here 
in miniature, is what happens in the price system as a whole. 
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Table 42 


Commodity Prices at Wholesale '■ 




Price 

Price 

Commodity 

Unit 

(wholesale) 

(wholesale) 



OUober, 

November, 



1937 

1937 

Brick, common building, aver- 




age of yard prices 

1,000 

$12,113 

$12,113 

Pig iron, basic, Valley furnace 
Cement, Portland, average of 

Gross ton 

23 500 

23.500 

plant prices 

Bbi. 

1.667 

1.667 

Linseed oil, raw, N. Y. 

Pound 

.110 

.106 

Steel billets, rerolling, Pitts. 

Gross ton 

37 000 

37.000 

Steel, scrap, Chicago 

Gross ton 

14.688 

12 500 

Copper, electro!., refinery 

Pound 

.119 

108 

Lead, pig, N. Y. 

Pound 

058 

.051 

Zinc, pig, N. Y. 

Coal, anthr., chestnut, average 

Pound 

.065 

.060 

of 15 price series, on tracks, 




destination 

Coal, bit., mine run, average of 
27 price series, on tracks, des- 

Net ton 

9.472 

9 610 

tination 

Net ton 

4.305 

4 303 

Crude petroleum, Penn., at wells 
Gasoline, motor, California, re- 
finery 

Bbl. 

2.413 

2.350 

Gal. 

.083 

.085 

Cotton, middling, N. 0. 

Pound 

.083 

.080 

Wheat, no. 2 red winter, Chi. 

Bu. 

1.033 

.951 

Sugar, granulated, N. Y. 

Pound 

.048 

.048 

All prices do not, with absolute uniformity, move up or 


down or remain constant. Each of the thousands of com- 
modities traded in on the markets of any country, or of the 
world, moves in its own individual way, subject to a variety 
of influences. Yet it does not act in isolation. In its price 
movements it affects other commodities, and is affected 
by them. And, in addition to the forces peculiar to each 
commodity, there are broad forces which act throu^out 
the price system, affecting all commodities. It is the busi- 


* As compiled by the TJ. S. Bweau of Labor Statistics. 
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ness of the economic statistician to bring order out of the 
chaos of price movements taking place at any given tune 
and, out of the multiplicity of minor movements, to pick 
the broad trends which alfect the whole economic sys- 
tem. 

The forces bringing about the price movements that are 
to be studied are numerous and complicated, but some 
general conclusions may be drawn with regard to them. 
There are, in the first place, all those changes in production 
and consumption conditions peculiar to individual commodi- 
ties and affecting directly the prices of those conunodities. 
The opening of new fields, improvements in production 
technique in individual cases, changes in fashion and the 
transfer of demand from some commodities to others, changes 
in demand and supply with the seasons — all these are 
causing constant price readjustments. These are the changes 
which in ordinary times are most obvious, which are brought 
home directly to the individual merchant or consumer. 
Such changes affect the whole price system, as has been 
pointed out, but not in general by causing upward or down- 
ward movements in the sjrstem as a whole. 

These general movements are due to forces that are 
broader in their scope. The general improvement in pro- 
duction technique and the increase in the productivity of 
hmnan labor which has resulted have, by increasing the 
supply of commodities available for consumption, affected 
prices. Changes in monetary systems and, in particular, 
changes in the gold supply have exerted a direct and imme- 
diate influence upon prices, by affecting the supply of money 
in circulation. Similar in character have been changes in 
banking and credit systems and changes in commercial 
practice that have affected the use of credit instruments 
and the rapidity of circulation of money and credits. All 
these forces influence prices, though their incidence is not 
so specific as are those of the factors affecting individual com- 
modities directly. 
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PuEPOSB OF General Index Numbers op 
Wholesale Prices 

These separate forces cannot be isolated and evaluated. 
Their joint action causes a perplexing variety of price 
changes. In studying these changes the problem might be 
approached from several different points of view. It might 
be desired to study the readjustments that take place within 
the price system, to determine the nature and degree of the 
shifts within the system that come with changing conditions. 
Such a study would yield valuable information as to the 
behavior of prices and the character of their interrelations. 
Our immediate problem, however, is the determination of 
the net resultant of all these forces. Do all price movements 
cancel each other so that while some prices move up and 
some down there is no net change? Or is there at a given 
time a preponderance of movements in one direction, causing 
the level of general prices to move upward or downward? 
If there is such a trend, what is it, and how may it be meas- 
ured? Are the statistical methods that have been explained 
in the earlier sections applicable to the solution of this 
problem? 

The first step in this study involves the answering of the 
last question asked. It has been brought out that methods 
of summarizing quantitative data have been developed, 
but that these methods are applicable only when certain 
conditions are fulfilled. An average, it was noted, has no 
significance imless it represents a distinct central tendency 
in a mass of homogeneous data. Moreover, the type of 
average to be employed depends upon the character of the 
distribution it is to represent. Until the distribution of 
the original data is studied no average or other statistical 
measure can be intelligently employed. We must first, 
then, determine what the raw materials of the problem are, 
and study the frequency distributions secured when these 
raw materials are organized. 
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For the present a quite general purpose will be assumed, 
the determination of the change in the level of general 
wholesale prices between two specific dates. This is equiva- 
lent, of course, to measuring the change in the purchasing 
power of money in wholesale markets. The raw materials 
of the problem consist of a nmnber of price quotations on 
individual commodities, quotations being secured for the 
two dates to be compared. Each pair of quotations meas- 
ures the change in the price of a single commodity, a change 
caused by the interplay of many forces. When a great 
many such price quotations are brought together we have 
a mass of data representing the interaction of a multitude 
of forces, some individual and specific in their incidence, 
some general, affecting the prices of large groups of com- 
modities or of all commodities. What we seek to determine 
is the net resultant of all these factors. We seek a measure 
of the composite effect of the numerous forces that are 
causing individual prices to rise or fall. This measure will 
constitute an index number of wholesale prices. 

The unit with which we must deal is a single price varia- 
tion. Whether the statistical methods with which we are 
familiar may be employed in the organization and analysis 
of a nmnber of such units depends upon the behavior of 
such miits in mass. The following examples illustrate the 
frequency distributions secured when these data are clas- 
sified. 


Fbbqubnct Disteibxjtions op Peicb Ratios 

Each price variation is, of course, a ratio, the ratio of the 
price of a commodity at a given date to the price of the 
commodity at another date. The ratios may be reduced 
to a comparable basis by putting them all in the form of 
relatives, of the type illustrated in the earlier examples of 
index numbers. Thus, using one of the pairs of price quo- 
tations given above, the ratio of the price of steel scrap 
in November, 1937, to the price in October, 1937, is 
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$12,500: $14,688, which, in the form of a relative, becomes 
86.1: 100. In constructing the following frequency table, the 
prices at wholesale in 1927 of 670 conunodities were ex- 
pressed as relatives, with the 1926 price as a base in each 
case. The distribution of these 670 relative numbers is 
shown in Table 43. 


Table 43 


Dtsiribiiiion of the Relative Prices of 670 Commodities in 1927 * 



(Average prices in 1926 = 100) 


Related prices 

Midrpoint 

m 

No, of cases 

f 

Percentage 0 / total 
number of cases 

62 5- 57 4 

55 

1 

.1 

57 5- 62 4 

60 

2 

.3 

62.5- 67 4 

66 

6 

.9 

67.5- 72.4 

70 

7 

1 0 

72 5- 77.4 

75 

8 

1.2 

77 5- 82 4 

80 

25 

3.7 

82 6- 87 4 

85 

50 

7 5 

87 5- 92 4 

90 

76 

11 3 

92 5- 97 4 

95 

136 

20.3 

97 5-102 4 

100 

196 

29.3 

102 5-107 4 

105 

83 

12.4 

107.6-112 4 

no 

26 

3 9 

112.5-117 4 

115 

16 

2 4 

117 5-122 4 

120 

14 

2 1 

122 6-127.4 

125 

12 

1.8 

127 5-132.4 

130 

2 

.3 

132 5-137 4 

135 

3 

.5 

137 5-142 4 

140 

6 

.8 

142 5-147.4 

145 

1 

.1 

147 5-152 4 

150 



162 6-167.4 

155 

1 

.1 



670 

100 0 


The frequency polygon representing this distribution ap- 
pears in Fig. 49. For purposes of comparison with similar 
distributions the figure shows the percentage distribution. 

* The 670 commodities induded were those employed by the XT. 8. Bureau 
of Labor Statistics in the construction of its index of wholesde prices. The 
original figures, and the relatives, appear in BvUeAn i7S, of that Bureau, on 
"Wholesale Prices, 1913-1927." 
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The correspondence of this frequency distribution to the 
standard types portrayed iu earlier sections is obvious. 
There is the same marked concentration about a central 
tendency, in this case a tendency of prices to remain stable, 
for 29 per cent of all the cases showed a change not exceed- 
ing 2 . 6 per cent from their prices in the base year. There 
is also, in this case, a fairly symmetrical distribution about 
this central tendency, though the range above the mode is 
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Relative Price 

Fig. 49. — Frequency Polygon: Distribution of Relative Prices of 
670 Conunodities in 1927 (Average prices in 1926 = 100) 

sli^tly greater than the range below. Without at present 
considering the question as to which average might best be 
used to represent the central tendency in this distribution, 
it is apparent that the use of some average is quite legitimate. 

The example just given has been based upon price varia- 
tions from one year to the next, over a period during which 
the level of general prices declined slightly (4.6 per cent). 
W. C. Mitchell gives a much more comprehensive illustrar 
tion, based upon the distribution of 5,578 price variations 
from one year to the next over the period 1890-1913, which 
shows the same general grouping. The excess of the range 
above the mode over the range below is somewhat more 
pronounced, in connection with which fact it should be 
noted that prices were rising during most of the 23 years 
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covered. The distribution secured by Mitchell is shown in 
Fig. 42. 

The inertia of prices is most conspicuous when year-to- 
year price changes are studied. It is therefore advisable to 
consider the character of price variations over a longer 
period, that we may learn whether the same type of dis- 
tribution is secured. Two examples are given, one of price 
changes over a seven-year period, marked by a considerable 
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Fig. 50. — Frequency Polygon: Distribution of Relative Prices of 
774 Commodities in 1933 (Average prices in 1926 = 100) 

decline in prices, the other of price changes over a five-year 
period characterized by rapi^y rising prices. The table 
following shows the distribution of 774 price variations, 
prices in 1933 being expressed as relatives on a 1926 base. 
The general level of wholesale prices, it should be noted, 
declined some 33 per cent from 1926 to 1933. 

The data in Table 44 are plotted in the form of a frequency 
polygon in Fig. 60, the percentage distribution being shown. 
It win be noted that the distribution is curtailed, the five 
upper classes being omitted. 

The distributions depicted in Pigs. 49 and 50 differ ma- 
terially. The range of the variations is greater in the second 
case, a condition naturally to be expected because of the 
longer period covered. Secondly, a very much smaller per- 
centage of cases is concentrated in the modal group, though 
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Table 44 


Distnbviion of Relative Prices of 774 Commodities in 1933 



(Average prices in 1926=100) 


RelcUive prices 

Midrpoint 

No, of cases 

Percentage of total 

m 

f 

number of cases 

10- 14.9 

12 5 

3 

A 

15- 19.9 

17.5 



20- 24.9 

22 5 

1 

.1 

25- 29 9 

27 5 

7 

.9 

30- 84.9 

32 5 

13 

1.7 

36- 39.9 

37 5 

24 

3 1 

40- 44.9 

42 5 

28 

3.6 

45- 49.9 

47.5 

61 

6.6 

60- 54.9 

52 5 

49 

6 3 

55- 59.9 

57.5 

60 

6.6 

60- 64.9 

62.5 

62 

8.0 

65- 69.9 

67 5 

68 

7,5 

70- 74.9 

72.6 

93 

12.0 

75- 79.9 

77 5 

81 

10 5 

80- 84.9 

82.5 

62 

8.0 

85- 89.9 

87.5 

67 

8.7 

90- 94.9 

92.5 

40 

5.2 

95- 99.9 

97.5 

27 

3.5 

100-104.9 

102.6 

27 

3.5 

106-109.9 

107.5 

11 

1.4 

110-114.9 

112 5 

6 

.8 

115-119.9 

117.5 

8 

1.0 

120-124.9 

122 5 

1 

.1 

126-129.9 

127.5 

2 

.3 

155-169.9 

167.6 

1 

.1 

180-184.9 

182 5 

1 

.1 

190-194.9 

192.5 

1 

.1 



774 

100.0 


there is still a pronounced central tendency. Both distribu- 
tions, as plotted on the arithmetic scale, are fairly symmetri- 
cal, though a few extreme cases extend the actual upper limit 
of the second distribution. In Fig. 49 the concentration about 
the central tendency is much more marked, and the devia- 
tions of individual price ratios from the central tendency 
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are smaller. This distribution resembles one which would 
be secured from highly accurate physical measurements, or 
the distribution of shots from a very accurate piece of artil- 
lery. The second curve corresponds to one representing less 
accurate physical measurements, or to the distribution of 
shots from an old or inaccmate field piece. The modal 
value occurs less frequently and the deviations from the 
central tendency are greater. It has been established that 
the longer the period covered in price comparisons such as 
those made above, the more pronounced is the tendency 
shown in the second curve. The value of the maximum 
ordinate falls and the range of the distribution increases. 
The curve becomes flatter and more extended as the time 
interval increases. And, quite obviously, as this process 
goes on the representative character of any type of average 
declines. Unless there is concentration about a central 
tendency an average is merely an abstraction, without con- 
crete significance. 

It is possible at this point to state as a tentative conclu- 
sion that price variations are capable of statistical measure- 
ment, that they may be represented appropriately by an 
average value, provided the period covered is not too long. 
No definite statement can be made as to the maximum 
period over which price variations may be measured. Index 
numbers having accurate and significant values must be 
based upon comparisons over relatively short periods, the 
most accurate being year-to-year comparisons. Index nmn- 
bers designed merely to show general trends in prices may 
cover longer periods, though the makers and users of such 
index numbers should realize their limitations.^ 

As a final example we may note the distribution of the 
relative prices of 1,437 commodities in 1918, average prices 
during the period July, 1913 to June, 1914 serving as base.® 

* Cf. W. C. Mitchdl, "The Makmg and Using of Index Numbers,” BvUe- 
tin S84 (Wholesale Price Series), U. S. Bureau of Labor Statistics. 

‘ Data compiled by the Price Section of the War Industries Board; repro- 
duced in Part I, BxMetin i84, U. S. Bureau of Labor Statistics, 70. 
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This was a period marked by rapidly rising prices. In con- 
sulting the graph (Fig. 51) it should be noted that the scales 
are not the same as those employed in the two figures pre- 
ceding. 

A study of this distribution bears out the conclusion 
reached from the two examples preceding. There is a central 
tendency sufficiently pronounced to be well represented by 
an average. In this case, moreover, the modal group is 



Fig. 51. — Frequency Polygon: Distribution of Relative Prices of 1,437 
Commodities in 1918 (Average prices July 1913 to June 1914 = 100) 


that with a mid-point of 180, so that the tendency toward 
concentration cannot be attributed to inertia, but to the 
presence of external forces affecting the price system as a 
whole. There is, however, one marked point of difference 
between this distribution and the two others. The tendency 
toward skewness, which was in evidence in the first example, 
is pronounced in this case. The curve, as plotted on the 
arithmetic scale, is markedly asymmetrical. The greatest 
concentration is near the lower limit of the scale and a long 
tail, extending in fact far beyond the limit of the chart. 
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tapers out to the right. The highest relative price, indeed, 
is 3,009, representing an increase of 2,909 points. The small- 
est relative price, in comparison, is 36, representing a decline 
of 64 points on the scale. 

A price increase, e^ressed as a relative, has no upper 
limit . An increase of 100, 500, 1,000 per cent or more is 
conceivable and possible. The greatest price iacrease noted 
by the War Industries Board in its study of prices during 
the war was one of 4,981 per cent, in the case of acetipheneti- 
din. But 100 per cent is the maximum decline possible, as 
that would mean that the price of a commodity had fallen 
to zero. This is the explanation of the skewness noted in 
the curves shown. When any considerable number of price 
ratios are tabulated the corresponding frequency curve, 
plotted on an arithmetic scale, shows this characteristic 
feature, a feature which is most conspicuous during a period 
of rising prices. 

The argument developed in the preceding pages may be 
briefly summarized. Before discussing the practice of index 
number construction it was considered advisable to study 
the character of the raw materials and the nature of the 
distributions secured when these materials are brought to- 
gether, in order to determine whether ordinary statistical 
methods are appropriate. The raw materials, we have seen, 
consist of individual price variations, expressed as ratios. 
When a number of these ratios are assembled a frequency 
distribution is secured which somewhat resembles the dis- 
tribution of data following the normal law of error. A 
central tendency, which may legitimately be represented 
by an average, is apparent in the distribution of price varia- 
tions. The central tendency is less marked, however, and 
the deviations from it are more pronounced, the longer the 
p^od covered in the price comparison, so that an average 
becomes less representative as this period increases. In 
addition, a tendency toward skewness has been noted, and 
this was seen to be quite pronounced in a period of rising 
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prices. This skewness is due to the fact that we are dealing 
with ratios that have a definite lower limit and no upper 
limit. 


Vamett of Methods Employed in Index Ntjmbeb 
Construction 

Many methods have been and are being employed in the 
construction of index numbers of wholesale prices. Usage 
varies for many reasons. There are differences of opinion 
as to which is theoretically the best method. There are 
practical difficulties to be surmoimted, difficulties which 
inevitably cause differences in practice because of the vary- 
ing resources of the agencies engaged in these tasks. And 
there are, finally, differences due to the varying purposes 
for which index numbers are constructed, the varying ques- 
tions they are designed to answer. 

Prevailing differences in practice and differences in the 
results secured by the employment of various methods in 
the construction of index numbers can perhaps be illus- 
trated most effectively by the application of a number of 
methods to the same data. Table 45, on the preceding page, 
presents the raw material to which these various methods are 
to be applied — the average farm prices, on December 1, of 
twelve leading crops, from 1919 to 1935. 


EXPLANATION OF SYMBOLS 

The symbols to be employed in the computation of dif- 
ferent t 3 q)es of index numbers have the following meanings: 

: price of a given commodity at time “0” (the base period). 
; quantity of same commodity at time “0”. 

: price of same commodity at time “ 1 ”. 

; quantity of same commodity at time “ 1 
; price of a second commodity at time “0”. 

: quantity of second commodity at time “0”. 

; price of second commodity at time “ 1 ”. 

; quantity of second commodity at time “ 1 

: a price relative (relation of price of a given commodity at 


Po 

3o' 

Pi' 

ffi' 

Po" 

?o" 

Pi" 

?i" 

PL 

Po' 
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time “1 ” to price of same commodity at time “O”)- 

: a quantity relative. 

Qo 

Po : price level at time “0”. 

Pi : price level at time “ 1 

Simple Index Numbbes op Peices 

In his exhaustive analysis of methods of index number 
construction ^ Irving Fisher distinguishes six fimdamental 
types: the aggregative (or price aggregate), the arithmetic, 
harmonic, geometric, median, and mode. The latter has 
never been employed in a practical way, and may be omitted. 
The characteristics of the five remaining types may be 
brought out by considering each of them in its simplest form, 
before examining the more complicated combinations. 

AGGEEGATBS OP ACTUAL PEICES 

In the construction of index numbers of the simple ag- 
gregative type, commodity prices pertaining to a given 
date are added; general price changes are measured by 
comparing the results thus secured for different dates. Using 
the above symbols 

^ _ ^ 1 . ■ ' 

Po Spo 

When such index numbers are constructed from the data of 
Table 45 the results in Table 46 on page 182 are secured. 
The actual aggregates are given in column (2) ; to facilitate 
comparison the same figures are reduced to relatives, with 
the 1910 aggregate as base, in column (3). 

The results secured by this method of constructing index 
numbers of prices will- be compared shortly with results 
secured from the same data by other methods. The chief 
weaimess of this t 3 pe of mdex number is obvious. This is 
not an imwei^ted nor yet an equally weighted index. 
The influence of each commodity upon the result is depend- 
ent upon the price of the unit in which it happens to be 

' The MaMng Index Nimhera, Houghton Mifflin Go., 1922. 
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Table 46 

Index Numbers of Farm Crop Prices 


(Aggregates of actual prices) 


(1) 

(2) 

(3) 

Year 

Index 

Index, relative 

{aggregate of 
actual prices) 

(1919 = 100) 

1919 

$36 349 

100 

1920 

26.790 

74 

1921 

18 690 

51 

1922 

19 913 

55 

1923 

21.838 

60 

1924 

23 142 

64 

1925 

23 831 

66 

1926 

22 499 

62 

1927 

19 291 

53 

1928 

19.584 

54 

1929 

21.339 

59 

1930 

18.290 

50 

1931 

13 211 

36 

1932 

9 503 

26 

1933 

13 691 

38 

1934 

20 723 

57 

1935 

12.844 

35 


traded. In the present index, hay, which is quoted by the 
ton, is given more weight than all the other 11 commodities 
combined, with flaxseed second in importance. The index 
secured by adding the quotations is weighted in an entirely 
illogical fashion and cannot be accepted as reflecting the 
course of farm crop prices. 

One method which has been employed for avoiding the 
unequal weighting caused by the difference in units in 
which different commodities are traded is to reduce all 
quotations to the same unit. Thus hay, rice, com, cotton, 
and the other commodities might all be quoted by the 
pound, and these quotations added to secure the index. 
Yet this method, which has been employed in the con- 
struction of BradstreeVs index, merely replaces one system 
of illo^cal westing by an equally illogical one. Equal 
weight, if such is desired, is not given to all commodities 
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by this method. Thus, in 1919 hay was worth $.010075 
per pound, cotton $.356 per pound and rice $.059 per 
pound, cotton having a weight in an aggregate of per pound 
prices 6 times that of rice and 35 times that of hay. 

ARITHMETIC AVERAGES OP RELATIVE PRICES 

Another method employed in the construction of index 
numbers involves the reduction of each quoted price to a 
relative, with reference to the price of the same commodity 
at a certain basic date, these relative figures then being 
averaged by any of the conventional methods. The example 
in Table 47 illustrates the first phase of this process, data 
for two years being utilized. The year 1919 is taken as base. 

Table 47 


Computation of Relative Prices for the Construction of Index Numbers 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Commodity 

Unit 

PncOi 1919 

Relative 

Price, 19S0 

RelaJUve 

Com 

Bu. 

$ 1.343 

100 

$ .656 

48.8 

Cotton 

Lb. 

.356 

100 

.139 

39 0 

Hay 

Ton (sh.) 

20.150 

100 

17.780 

88 2 

Wheat 

Bu. 

2.131 

100 

1.433 

67.2 

Oats 

Bu. 

702 

100 

.456 

65 0 

Wh. Potatoes 

Bu. 

1 580 

100 

1.128 

71.4 

Sugar 

Lb. 

.102 

100 

.053 

52.0 

Barley 

Bu. 

1 215 

100 

.716 

58 9 

Tobacco 

Lb. 

.390 

100 

.212 

54.4 

Flasseed 

Bu. 

4.383 

100 

1.770 

40.4 

Rye 

Bu. 

1,331 

100 

1.256 

94.4 

Rice 

Bu. 

2.666 

100 

1.191 

44.7 




1,200 


724.4 


From these figures the arithmetic averages of relative 
prices in these two years may be readily computed. The 

formula for any single relative is When there are N 

relatives the formula for the index number at time “ 1 ” is 
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In the present case 

Index (1919) = = 100. 

794 4 . 

Index (1920) = = 60.4. 

Index numbers computed in this way for the years 1919 to 
1935, inclusive, are shown in colmnn (3) of Table 50. 

This type of index number is usually termed an “un- 
weighted” index of relative prices. It is weighted, however, 
just as are the types illustrated in the two examples pre- 
ceding. The quantity employed as weight in each case is 
the amount, of each commodity which would sell for $100 
in the base year. In the preceding example the following 
quantities have been employed as weights : 


Com 

74 5 bu. 

Cotton 

280.9 lbs. 

Hay 

4 96 tons 

Wheat 

46 9 bu. 

Oats 

142 5 bu. 

Potatoes 

63.3 bu. 

Sugar 

980 4 lbs. 

Barley 

82.3 bu. 

Tobacco 

256.4 lbs. 

Flaxseed 

22 8 bu. 

Rye 

76.1 bu. 

Rice 

37.5 bu. 


What has been done, in effect, in the computation of the 
simple average of relative prices has been to determine the 
aggregate amount for which the above quantities would sell 
in each of the eleven years included. At 1919 piices each 
of the above quantities would sell for $100, the aggregate 
value being $1,200; at 1920 prices the aggregate value of 
the above quantities was $724.40. These aggregates, di- 
vided by 12, give the index numbers shown in column (3), 
Table 50: 100 for 1919, 60 (60.4) for 1920, etc. Thus the 
“unweighted average of relative prices” is in fact a wei^ted 
aggregate of actual prices. It is equally weighted in the 
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S6I1S6 thfit the value of the quantity of each commodity 
employed as weight was equal to $100 in the base year, 1919.^ 

MEDIANS OP EEIiATIVE PEICES 

The median rather than the arithmetic Tnp.fl.n may be 
employed in securing the average of the relative prices for 
each year. When the relatives in coluron (6) of Table 47 
are arranged in order of magnitude the following distribution 
is secured: 


39 0 

58.9 

40.4 

65.0 

44 7 

67 2 

48.8 

71.4 

52.0 

88.2 

54.4 

94.4 


The smallest relative price is 39.0, the greatest 94.4; 
the median value is 56 . 65. This median value is the index 
number for 1920. All the index numbers computed in this 
way from the medians of relative prices are presented in 
column (4), Table 50. 


GBOMETEIC AVERAGES OP RELATIVE PRICES 


The geometric averages of the relative prices for the 
various years may now be computed and the results com- 
pared with those secured in the preceding examples. A 

n * 

single relative being represented by the s3mibol the 

Po 

formula for the geometric mean of JV relatives is ' 


f Po Po Po 


A geometric mean is generally computed by the aid of 
logarithms; in this case 


Log Mg = 



^ Attention was called to this characteristlo of the simple average of relative 
prices by F. R. Macaulay, American Economic BeoieWf Dec., 1915, 928. 
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The method of computation may be illustrated for the 
years 1910 and 1911. The relative prices of the various 
commodities are repeated from Table 47. 

Table 48 

C<mputation of Geometric Averages of Relative Prices 


(1) 

(2) 

(3) 

(4) 

(5) 

Commodity 

Retatwe price, 

Logariffmof 

Belaiive price, 

Logarithm of 

1919 

fig. in col. (2) 

1920 

fig. in col. (4) 

Corn 

100 

2 0 

48.8 

1.68842 

Cotton 

100 

2.0 

39.0 

1.59106 

Hay 

100 

2 0 

88.2 

1.94547 

Wheat 

100 

2.0 

67.2 

1.82737 

Oats 

100 

2.0 

65.0 

1.81291 

Wh. Potatoes 

100 

2.0 

71.4 

1.85370 

Sugar 

100 

2 0 

52.0 

1 71600 

Barley 

100 

2.0 

58 9 

1.77012 

Tobacco 

100 

2 0 

54 4 

1 73560 

Flaxseed 

100 

2.0 

40 4 

1.60638 

Rye 

100 

2.0 

94.4 

1.97497 

Rice 

100 

2.0 

44.7 

1.65031 



24.0 


21.17231 


Logilf.(1919) = g = 2 

Mg = anti-logarithm of 2 = 100 

Log Mg (1920) = = 1.76436 

12 

Mg = anti-logarithm of 1.76436 = 58.1. 

This value, 58.1, is the index number for 1920. The 
results for all the years are summarized in column (5), 
Table 50. 

HABMONIC AVERAGES OP RELATIVE PRICES 

The characteristics of the harmonic average have been 
discussed in a preceding chapter. The reciprocal of the 
harmonic mean, it mil be recalled, is the arithmetic mean 
of the reciprocals of the constituent measures. The con- 
stituent items, in the present case, are price relative of the 
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form The reciprocal of such a relative is —r The 

Po ^ Pi 

formula for the harmonic mean of N price relatives is, 
therefore, 



The method of computation is illustrated in Table 49. 


Table 49 


Compviaiion of Harmonic Averages of Relative Prices 


(1) 

(2) 

.(3) 

(4) 

.(6) 

Commodity 

Relatm price, 

Reciprocal of 

Relative price, 

Reciprocal of 

1919 

fig, in col, (2) 

1920 

fig. in col. (4) 

Com 

100 

.01 

48.8 

.02049180 

Cotton 

100 

.01 

39 0 

02564103 

Hay- 

100 

.01 

88 2 

.01133787 

Wheat 

100 

,01 

67.2 

01488095 

Oats 

100 

01 

65 0 

01538462 

Wh. Potatoes 

100 

01 

71.4 

.01400560 

Sugar 

100 

.01 

62 0 

01923077 

Barley 

100 

.01 

58 9 

.01697793 

Tobacco 

100 

01 

64.4 

.01838235 

Flaxseed 

100 

01 

40.4 

.02475248 

Rye 

100 

.01 

94.4 

.01059322 

Rice 

100 

.01 

.12 

44.7 

.02237136 

.21404998 


H ( 1919 ) = -^ = 100 


H ( 1920 ) = 


12 

.21404998 


56 . 1 . 


The index niunbers computed in this way for all the years 
included in the study are shown in column (6), Table 50. 
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In the construction of the five types of index numbers 
explained above no attempt has been made to use a logical 
weighting system. All are termed “unweighted” averages, 
a term which is quite misleading. The first index con- 
structed, based on aggregates of actual prices, is a heavily 
weighted index number, though the weights are illogical. 
In the next four the quantities employed as weights are the 
amounts purchasable for $100 in 1919. The five results 
are brought together and compared in Table 50. In each case 
the index is given to the nearest whole number. These index 
numbers are plotted in Fig. 52. 

Comparison op Simple Index Numbers ••• 

The four averages of relative prices agree much more 
closely with each other than with the index numbers based 
on aggregates. For reasons already suggested the latter is 
quite untrustworthy as a measure of price changes. Of the 
other index numbers, the arithmetic, geometric, and har- 
monic means show a consistent relationship, a fact which 
follows from the nature of the averages employed. Except 
in the base year the geometric mean is always less than 
the arithmetic and the harmonic is always less than the 
geometric, the amount of difference increasing as the dis- 
persion of prices becomes greater. The median, with only 
twelve items to be averaged, is somewhat unstable, and its 
relationship to the other averages is not always a consistent 
one. 

How are we to choose among these varying results? No 
one of these “unweighted” index numbers is perfect, for 
weights which have crept in do not measure the relative 
importance of the various commodities included in the 
index numbers. But, neglecting for the moment the question 
of weights, is it possible to test the adequacy of the different 
methods of measuring changes in the prices as given? 
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Table 50 


Index Numbers of Farm Crop Prices, 1919-1935 

(1919 = 100) 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 


Aggregates 

Arithmetic 

Medians 

Geometric 

Harmmic 

Year 

of actual 

averages of 

of 

averages of 

averages of 

prices (as 

relative 

relative 

relative 

rdabive 


relatives) 

'prices 

prices 

prices 

prices 

1919 

100 

100 

100 

100 

100 

1920 

74 

60 

57 

58 

56 

1921 

51 

44 

42 

43 

42 

1922 

55 

51 

50 

50 

49 

1923 

60 

55 

50 

54 

53 

1924 

64 

60 

61 

59 

58 

1925 

66 

59 

53 

57 

55 

1926 

62 

53 

49 

52 

50 

1927 

53 

53 

55 

52 

52 

1928 

54 

48 

48 

47 

46 

1929 

59 

54 

53 

53 

52 

1930 

50 

38 

32 

36 

35 

1931 

36 

27 

27 

27 

26 

1932 

26 

20 

18 

19 

19 

1933 

38 

35 

33 

34 

34 

1934 

57 

48 

48 

46 

43 

1935 

35 

35 

36 

35 

34 



Fig. 52. — Comparisoa of Five Simple Index Numbers of Farm Crop 
Prices, 1919-193.5 (1919 = 100) 
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THE TIME EEVERSAL TEST 

For this purpose Irving Fisher has employed what he 
terms the “time reversal test.” This is merely a test to 
determine whether a given method will work both ways in 
time, forward and backward. If from 1935 to 1936 sugar 
should increase from four to eight cents a pound, the price 
in 1936 would be 200 per cent of the price m 1935, and the 
price in 1935 would be 50 per cent of the price in 1936. 
One figure is the reciprocal of the other; their product 
(2.00 X .50) is unity. Similarly, if a given method of index 
number construction shows the general price level in one 
year to be 200 per cent of the level in the preceding year, it 
should work correctly when reversed; it should show that 
the price level in the first year was 50 per cent of the price 
level in the second year. When the data for any two years 
are treated by the same method, but with the bases reversed, 
the two index numbers secured should be reciprocals of each 
other. Their product should always be unity. If it is not, 
there is an ioherent bias in the method. 

This test may be applied to the methods employed above, 
using prices for 1919 and 1920. With 1919 as base the 
following results were obtained; 


Year 

Aggregates 
of ai^ucd 
prices (os 
relatives) 

Arithmetic 
averages of 
relative 
prices 

Medians of 
relative 
prices 

Geometric 
averages of 
rdaUve 
prices 

Harmonic 
averages of 
relaMve 
prices 

1919 

1920 

100 

73.70216 


■ffiQHjjjjll 


100 

56.0617 


and with 1920 as base: 


Year 

Aggregates 
of adfual 
prices (as 
relatives) 

Arithmetic 
averages of 
reIMve 
prices 

Medians of 
relative 
prices 

Geometric 
averages of 
relative 
prices 

Harmonic 
averages of 
relative 
prices 



178.36666 

100 

176.85 

100 

172.04 

100 

165.6467 

100 
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When the index numbers for 1911 in the first table are 
multiplied by the corresponding index numbers for 1910 
in the second table, we have the following values. (In 
securing these products the index numbers are put in the 
ratio, not in the percentage form.) 


Aggregates 
of actual 
prices 

Arithmetic 
ajuerages of 
relative 
prices 

Medians of 
relative 
prices 

Geometric 
averages of 
rdative 
prices 

Harmonic 
averages of 
rdative 
prices 

1 00 

1 0767 

1 00 

1 00 

9286 


This time reversal test is met by three of the methods 
employed. It is not met by either the arithmetic or har- 
monic averages. The former has a distinct upward bias, 
amounting to more than seven per cent when the errors for 
1919 and 1920 are compounded, while the harmonic mean 
shows almost as large an error in the opposite direction. 
Unless the inherent bias which is found in both these aver- 
ages is rectified in some way, methods based upon these 
averages should not be used in the construction of index 
niunbers. 


The Wbightinq of Index Numbers 

Five simple index numbers of prices have been described 
in the preceding section. With the introduction of wei^ting 
the number of possible combinations is greatly increased, 
but only a few of these types need concern us here. 

In the construction of an accurate measme of price changes 
logical weights must be employed, weights which truly reflect 
the relative importance of the commodities included. If the 
weighting problem is ignored haphazard and illogical wei^ts 
will inevitably be present, whether recognized or not. 

The data used in the preceding examples may be utilized 
to illustrate methods of weighting and to show the effects 
of varying we^ts upon the values of index numbers. The 






Table 61 

Annual Physical Production, Twehe Crops, 1919-1935 * 
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weights employed in constructing index numbers of faxm 
crop prices may be either the quantities or values of the 
crops produced, depending upon the tjqre of index selected. 
The quantities produced during the period 1919-1935 are 
given in Table 51. 

WEIGHTED AGGREGATES OF ACTUAL PRICES 

The thoroughly illogical results obtained when actual 
prices, as quoted, are totaled to secure an index number 
have been pointed out. The same objection cannot be 
made when the prices are appropriately weighted before 
the aggregate is taken. If for weights we employ the quan- 
tities produced in the base year (at time “0”) the formula 
for the weighted aggregate is 

Spigo 

Spo9o’ 

This is, in effect, the method employed by the United States 
Bureau of Labor Statistics, though the quantities are taken 
from a year other than the base year. The formula for this 
type of weighted aggregative index is known as Laspeyres’ 
formula. The method is illustrated in Table 52. 

The desired index numbers, in the form of relatives, may 
be computed from the aggregates secured by totaling col- 
umns (6) and (8) of Table 52. Either year may be taken 
as the base, and the price aggregate in the other year ex- 
pressed as a relative on this base. With the 1919 aggregate 
as base the index for 1920 is 58.2. Index numbers similarly 
computed for the other years are given in column (2), 
Table 55. 

Another type of weighted aggregate may be constructed, 
with wei^ts taken not from the base i)eriod but from the 
later period in the given comparison. That is, we may 
employ qi (quantity at time “1”) as weight in comparing 
prices at time “1” with prices at time “0,” and employ 
(quantity at time “2”) as weight in comparing prices at 
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time “2” with prices at time “0.” Algebraically, the 
formula for the index number at time “ 1 ” is 

Spigi 

Spogi* 

This is known as Paasche’s formula. The process of compu- 
tation is precisely the same as in the preceding example, 
except that the weights are changed with each successive 
year. The index numbers secured by this method are given 
in column (3), Table 55. 

The weights in these two cases have been guantUies, for 
prices, multiplied by quantities, give a^egates in dollar 
values. But in weighting individual price relatives, quanti- 
ties wiU not serve. The abstract relatives must be weighted 
by values, if the resulting products are to be comparable. 
For values are in terms of a common doUar unit, while 
quantities may be expressed in a variety of units. The values 
which are to be employed as weights may be derived in 
various ways. 

Fisher ^ outlines the four following methods, of which the 
second and third are hybrid types: 

I. Each weight = base year price X base year quantity (pogo). 

II. Each weight = base year price X given year quantity (jJogi). 

III. Each weight = given year price X base year quantity (pigo). 

IV. Each weight = given year price X given year quantity (pigi). 

Just as certain averages possess inherent bias, so a distinc- 
tive weight bias arises from each type of value wei^ting. 
(This inherent bias is absent from the quantity weighting.) 
A downward bias arises from weighting systems I and 11 
(in which base year prices are used), while an upward bias 
arises from wei^ting S3rstems III and IV (using prices in 
the given year). This is in part capable of mathematical 
demonstration ‘ and Jias in part been established by numer- 
ous trials. 

^ Irving Fisher, The Making of Index NumberSf 54. 

* An index weighted by type III must exceed an index weighted by type I. 

(Footnote B continued on •page 196.) 
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In the several examples next following we shall deal only 
with values of quantities produced in the base year, 1919. 
These values are given in the third column of Table 53. For 
weighting purposes they are taken to the nearest million. 

WEIGHTED AEITHMETIC AVERAGES OP 
RELATIVE PRICES 

In the computation of an index of this type, each relative 
is multiplied by the appropriate weight and the sum of the 
products is divided by the sum of the weights. The process 
is illustrated in Table 53. 

The index for 1920, it will be noted, is identical with that 
secured from the computations iUmtrated in Table 52. That 
index is a weighted aggregate of actual prices, the weights 
being the quantities produced in the base year. An arith- 
metic mean of relative prices, weighted by values in the base 
year, is always equal to a relative constructed from such an 
aggregate.^ 

{FooVnote 2 continued from page 196,) 

Weighting the price r^tive of a given commodity by type III, we have 

^ X pi3o 

Po 

while by type I we have 

— X po^o. 

Po 

If Pi exceeds po (if the price relative is above 100) the weight by type III 
(pi3o) is greater than the weight by type I (pogo)* That is, all relatives above 
100 are more heavily weighted by type III than by type I. But if pi is less than 
po the weight by type III (pigo) is less than the weight by type I (po^o). All 
relatives below 100 are less hea^^y weighted by type III than by type I. Thus 
the effect of all price increases is over-empharized and the effect of all price 
declines is under-emphasized by type III, giving a net result always greater 
than type I. The same is true of type IV as compared with typo II. As be- 
tween types I and IV there is no necessary relation, but in general an index 
weighted by type IV will exceed an index weighted by type I. Base year 
weighting involves a downward bias while given year weighting involves an 
upward bias. (For a more detailed discussion of bias in weighting see Fisher, 
The Making of Index Numbers, Chapter V and pages 384^87.) 

1 This may be readily demonstrated algebraically. The value of any com- 
modity in the base year is pojo, while the price rdative for a second year is ~ • 

Po 

{Footnofte 1 conlinued m page 197,) 
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Table 53 


Computation of Weighted ArithTnetic Averages of Relative Prices 


CoWr 

modiiy 

Relatwe 

price, 

1919 

Weight 

Rdative 

price 

X weight 

Relative 

price, 

1920 

Weight 

Relative 

price 

X weight 

Com 

100 

$3,598 

$359,800 

48 8 

$3,598 

$176,582 4 

Cotton 

100 

2,031 





Hay 

100 

1,543 

154,300 

88 2 

1,543 


Wheat 

100 

2,029 


67 2 

2,029 

136,348 8 

Oats 

100 

111 

77,700 


777 


Potatoes 

100 

470 

47,000 

71 4 

470 


Sugar 

100 

446 

44,600 




Barley 

100 

159 

15,900 

58.9 

159 

9,365.1 

Tobacco 

100 

563 

56,300 

54 4 

563 


Flaxseed 

100 

30 





Rye 

100 

105 


94 4 

106 


Rice 

100 

114 

11,400 

44.7 

114 

6,096.8 



$11,865 

$1,186,500 


$11,865 

$690,699.9 


(The weights employed are the values of the quantities produced in 
1919, in millions.) 


Weighted arithmetic mean (1919) - 

$690 699 9 

Weighted arithmetic mean (1920) « ' §65 “ 

(Footnote 1 continued from 'page 196.) 

The weighted mean of such price relatives is equal to 

f tf „ /// 

X Po V + ^ X po'V' + ^ X Po" V" + . . . 
po po po 

PoV“i-Po'V+Po"V" +• • • 

which reduces to 

Spigo 

2Jpogo’ 

a weighted aggregate of the type mentioned. 

In the same way the harmonic mean, weighted by full values in the second 
year, reduces to 

Spigi 

Spogi 

This has already been encountered as an aggregate of actual prices weighted 
by quantities in the second year. 
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WEIGHTED GEOMETRIC AVERAGES OP 
REIiATIVE PRICES 

The process of computing the weighted geometric mean 
is identical with that of computing the unweighted geometric 
mean, except that the logarithm of each relative is multi- 
plied by the given weight and the sum of these weighted 
logarithms is divided by the sum of the weights, the result 
being the logarithm of the desired index. ^ The method is 
illustrated in Table 54. 


Table 54 

Computation of Weighted Geometric Average of Relative Prices, 1920 




(1919 >= 100) 



Commodity 

Relative 
pricet 1920 

Logarithm of 
rdaUve ■price 

Weight 

Logarithm of 
retoMve price 
X weight 

Com 

48.8 

1.68842 

3,598 

6074 93516 

Cotton 

39.0 

1.59106 

2,031 

3231.44286 

Hay 

88 2 

1 94647 

1,543 

3001.86021 

Wheat 

67 2 

1.82737 

2,029 

3707.73373 

Oats 

65 0 

1.81291 

777 

1408.63107 

Potatoes, Wh. 

71.4 

1.85370 

470 

871.23900 

Sugar 

52 0 

1.71600 

446 

765.33600 

Barley 

58 9 

1.77012 

159 

281.44908 

Tobacco 

54 4 

1 73560 

563 

977. 14280 

Flaxseed 

40 4 

1.60638 

30 

48. 19140 

Rye 

94 4 

1 97497 

105 

207.37185 

Rico 

44 7 

1.65031 

114 

188.13534 




11,865 

20,763.46850 


Log Mg = 

20,763.46850 

11,865 

1.74998, 



Mg = 56.2 


The index for 1920 on the 1919 base is 56.2. Measure- 
ments secured for all the years of the period covered are 
given in column (5), Table 55, together with the other 
weighted index numbers already explained. 

* The fonnula for the weighted geometric mean is given in Chapter IV. 
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How are we to judge of the relative merits of these three 
index numbers? We may, first, apply the time reversal 
test which was employed in comparing the five simple index 
numbers. This test is not met by any of the weighted types 
we have constructed. The geometric is equally at fault 
mth the others. Though the simple geometric meets the 
test, the introduction of weighting imparts a bias to the 
result. Judged by that test alone none of the three is sat- 
isfactory. We may next try the second fimdamental test 
that Fisher has developed, which is termed the “factor 
reversal test.” 


THE FACTOR REVERSAL TEST 


The total value of a given commodity in a given year is, 

of course, the product of the quantity produced and the 

price per unit; algebraically, it is equal to p'q'. The ratio 

of the total value in one year to the total value in the preced- 
/ / 

ing year is If, from one year to the next, both price 

Po ?o 

and quantity should double, the price relative would be 200, 
the quantity relative 200, and the value relative 400. The 
total value in the second year would be four times the value 
in the first year. The value relative would be equal to the 
product of the price and quantity relatives, a relationship 
which is obvious in the case of a single commodity. 

If, for a munber of commodities, we construct an index 
of the price change from one year to the next and an index 
of the quantity change from one year to the next, we should 
expect their product to be equal to the ratio of the total 
values in the second year to the total values in the first 
year. If the product is not equal to the value ratio, there 
is an error in one or both of the index numbers. 

As an illustration, we may apply this test to the first 


aggregative index constructed 



An index of quan- 


tities may be computed from this same formula, merely 
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interchanging the g’s and the p’s; the formula becomes 

SgiPo 

S30P0 

The same price factor appears in numerator and denom- 
inator, as we desire to measme only the effect of the quan- 
tity change. Substituting the given values of the twelve 
farm crops we have 

Quantity inde., 1920 (1919 = 100) = = 1-0956. 

In percentage form the index of quantities produced in 
1920 is 109.56, with 1919 as base. The corresponding price 
index, by the same formula, is 58.24. The product 
1.0956 X .5824 = .6381. 

That is, if prices have decreased 41.76 per cent, while 
quantities have increased 9.56 per cent, the total value 
should show a decrease of 36.19 per cent. 

For the value ratio we have 

Spigi _ $7,441,317,450 _ 

Spogo $11,864,461,250 

There is a discrepancy here of about one per cent. The 
actual error is not great, but the formula definitely fails to 
meet the factor reversal test, and cannot be accepted as 
satisfactory. 

When this test is applied to the second aggregative index 
we secure the following values for 1920, with respect to 
1919 as base : 

Price index = = 57.25 

Spogi 

Quantity index = = 107.69 

2go2Jj 

Product = .5725 X 1.0769 = .6166 
(In securing the product the index numbers are put in 
the ratio, not in the percentage form.) 

Here is an error of the same magnitude in the other direction. 
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The weighted geometric average also fails to meet this 
fundamental factor reversal test. With respect to both the 
geometric index and the aggregates we have, apparently, 
by the introduction of weights spoiled index numbers which 
in their simple form were unbiased. Yet wei^ts we must 
have, if the index numbers are to represent the facts ac- 
curately. Neither a simple index nor a weighted form of a 
simple index will meet the two tests laid down as funda- 
mental. Professor Fisher tested 46 such formulas, of which 
only four (the simple geometric, median, mode, and ag- 
gregative) met the time reversal test, and none met the 
factor reversal test. 


THE “ideal” index 


A way out of this difficulty is offered by the possibility 
of “rectifying” formulas in a crossing process, by averaging 
geometrically formulas which err in opposite directions. 
Professor Fisher has made exhaustive trials of all possible 
formulas by this process, finding thirteen formulas in all 
which met both tests. Of these he has selected one as 
“ideal,” from the viewpoint of both accuracy and simplicity 
of calculation. This ideal index is the geometric mean of the 
two aggregative types illustrated above. Its formula ^ is 



Spigi 

Spogi 


This index may be computed readily, in the present 
instance, from the results already obtained. Thus for 1920 
we have 


Ideal index = V.5824 X .5725 
= .5774. 


In the customary percentage form this is 57.74. 

This index number meets both the time reversal and the 
factor reversal test. Applying the former: 

1 The same formula was devdoped independently by Bowley, Pigou, Walsh, 
and Young. See The Making of Index NwTibers, xv, 240-242. 
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Index of prices, 1920 (1919 = 100) = 57.74 
Index of prices, 1919 (1920 = 100) 173 . 18 

.5774 X 1.7318 = 1.00. 

For the factor reversal test, applied to the data for 1920 
(with 1919 as base), we have 

Index of prices = /j/|^ X = 57.74. 

\ Spo^o 2pogi 

Index of quantities = x = 108.62. 

V SgoPo SgoPi 

Value ratio = = .6272. 

2pogo 

Product of price and quantity indices = .5774 X 1.0862 = .6272. 

The ideal index, the two weighted aggregates that enter 
into its construction and the geometric mean weighted by 



f^Q 63. — Comparison of Pour Wetted Index Numbers of Farm Crop 
Prices, 1919-1935 (1919 = 100) 


values in the base year are given in Table 55 for the years 
1919 to 1935. The index numbers are plotted in Fig. 53. 

The wide discrepancies that were found between the vari- 
ous simple index numbers do not appear when the wei^ted 
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Table 55 


Comparison of W^hied Index Numbers of Farm Crop Prices, 

1919-1935 


(1) 

(2) 

(3) 

(4) 

(5) 

Year 

Aggregative 
{weighted by 
base year 
quantities) 

Aggregative 
{weighted by 
given year 
quantities) 

I deal index 
Geometric 
mean of in- 
dices in cols, 
(2) and (3) 

Weighted 
geometric cmrage 
{weighted by 
base year 
gumHties) 








Spogi 



1919 

100 0 

100 0 

100 0 

100.0 

1920 

58 2 

57 2 

57.7 

56.2 

1921 

42 8 

42.0 

42 4 

41.5 

1922 

63 6 

53 1 

53,4 

52.9 

1923 

59 8 

59 7 

59.8 

58.1 

1924 

65 0 

64 3 

64.6 

64.4 

1925 

57 9 

56.3 

57.1 

56 5 

1926 

51 4 

49.2 

50 3 

49 6 

1927 

54 6 

54.3 

64.4 

54 8 

1928 

51 8 

51 1 

51 4 

51.2 

1929 

54 1 

53 3 

63 7 

53 4 

1930 

41 3 

39 6 

40.4 

39.4 

1931 

26 6 

26 5 

26 0 

25 3 

1932 

19.0 

18 9 

19 0 

18 1 

1933 

32 6 

32 2 

32 4 

32 2 

1934 

61.0 

62 1 

51 8 

49 5 

1935 

38 1 

37 6 

37 9 

37 8 


indices are compared. There are significant differences, but 
there is none of the erratic behavior of some of the simpler 
forms. 

Of these four types the ideal index probably serves 
as the best measure of the average price change between 
1919 and each of the given years. ^ It is designed, it should 
be remembered, to measure the change between two stated 
times, and not for intermediate comparison. The value of 
the index for 1933, for instance, is determined by the rela- 
tion between prices and quantities in 1919 and in 1933. 

1 The year 1919, which is here employed as base, is not a satisfactory stand- 
aid of reference for economic purposes. It was a disturbed year, marlring a 
transition from war-time to peace-time conditions. 
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There is double weighting and the weights vary from year 
to year. If 1933 is to be compared with 1932 a new index 
is needed, in which the prices and quantities for 1933 and 
1932 alone are included. Direct comparison on the basis 
of the values for the ideal index given in the above table 
is liable to error, because of the weighting system employed. 

It is one of the merits of the geometric mean with constant 
weights that it permits the index for each year to be com- 
pared directly not only with the base year index, but with 
the index for any other year. The base may be shifted 
directly from the relatives, and the same result will be 
secured as if the computation were made from the original 
data. If this same system be followed with the ideal index 
no large errors may be expected, but strict accuracy will 
not be secured.* 


SOME ALTBENA.TIVB TYPES 

The chief obstacles in the way of general adoption of the 
ideal index arise from the difficulty of obtaining annual or 
monthly quantities to use as weights, and from the time 
involved in its computation. Where accuracy is essential 
the latter is not a serious difficulty. As a substitute formula 
which is much more quickly calculated Fisher has proposed 

S(go -f gi)pi 
S(go + gi)po* 

This formula, which has also been recommended by Edge- 
worth and Marshall, is considered by Fisher to be “the 
best practical aU-aroimd formula, taking all four points 
into account — accmacy, speed, minimum legitimate cir- 
cular discrepancy, simplicity.” Results from this formula 
wiU generally differ from those secured from the ideal for- 

* If year to year comparison be a primary aim in a given instance, the ideal 
index may be constructed on the chain system. Tjinlr index numbers are first 
constructed, each year serving as base for the computation of the index for the 
succeeding year- These links may then be “chained” with reference to a fixed 
base. Warren M. Persons has shown that the errors involved in following this 
method are cumulative, and may be serious if the links are chained for a number 
of years. 



Table 56 

Computation of Aggregative Ind^x, Weighted by Combined Quantities 
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57 7 (index for 1920 on 1919 base, in percentage form) 
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mula by less than one fourth of one per cent. Table 56 
on page 205 illustrates the method of computation, data for 
1919 and 1920 being employed. 

This formula requires the same data as the ideal index, 
and these are not generally to be had. Usually it is only 
possible to secure comprehensive quantity figures at each 
census period, and for the intervening years constant wei^ts 
must' be employed. In such cases the weighted aggregative 

Spigo 

2pogo 

is probably the most generally useful type. The weighted 
geometric has many virtues, but is subject to a definite 
weightihg bias. If no weights can be secured, or even ap- 
proximated, the simple geometric and the simple median 
are far better than any of the other simple types. The 
geometric mean is more generally useful than the median. 

An index number of prices is always based upon the study 
of a sample, the result being taken as representative of the 
entire field of prices from which the particular sample was 
drawn. Some method is needed, therefore, by which we may 
judge of the reliability of the different types of index num- 
bers, of their probable stability when computed from a 
number of successive samples. Some differences might be 
expected between index numbers based upon different sam- 
ples. With which type of index number would these differ- 
ences due to fluctuations of sampling be least? ^ 

Truman L. Kelley * has attempted to measiue the prob- 
able errors of the chief types of index numbers and has 
graded these types on the basis of excellence in this respect. 
Two index numbers, the wei^ted geometric mfian and the 
wei^ted median, are given the highest grade, as being the 
most reliable, the least affected by fluctuations of sampling. 

1 The subject of sampling, in relation to the reliability of statistical measures, 
is discussed in greater detail below. 

* Truman L. Kelley, StaHsHcdl Method, New York, Macmillan, 1921, 
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Fisher’s ideal index is ranked somewhat lower, though above 
the weighted arithmetic and harmonic averages of price 
relatives. The simple unweighted arithmetic average of 
relatives is given the lowest rating in the list. 

For reliability, flexibility, and general excellence Kelley 
selects the weighted geometric mean as the best t3^e of 
price index number. A ratio of aggregates 

ZpoW 

with selected weights (not necessarily precisely equal to the 
quantities marketed or consumed) is given a total score, 
based on the essential requirements of a good index number, 
as high as that of the weighted geometric mean and higher 
than that of the ideal index. Weights other than actual 
quantities are used in order that there may be flexibility 
ia the matter of weighting. 

The detailed discussion of procedures in the preceding 
pages has clearly shown that there are some definitely faulty 
formulas, obviously unsuited for use in the construction of 
index numbers serving ordinary purposes. Among the better 
formulas there are some differences in respect of liability 
to bias and character of data needed, and some variations in 
sampling reliability. The maker of index numbers will have 
these in mind in choosing a formula to employ under given 
conditions. A more important factor in his choice, however, 
will be the purpose to be served by the index number, the 
question it is designed to answer. A weighted aggregate of 
actual prices answers one question definitively. It gives, 
without equivocation, the aggregate cost of a fixed bill of 
goods at one period, in relation to the cost of the same bill 
of goods at another. A geometric mean of relative prices 
answers another question. It measures with accuracy the 
average ratio of the prices of given commodities at one period 
to corresponding prices at another period. Some questions 
(for example, that answered by an unweighted arithmetic 
average of relative prices) have little if any economic sig- 
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nificance. It is because one or two main questions have 
bulked large in economic discussion that emphasis has been 
placed upon the finding of a “best ’’ type of index number. Yet 
the terms ‘ ‘ best ’ ’ and ‘ ‘ ideal ’ ’ are unfortunate, for they imply 
that some absolute standard exists, with reference to which 
all formulas may be tested. No such absolute criterion may 
be applied to the diversity of research problems that call 
for the construction of index numbers. On the basis of his 
knowledge of the characteristics of different formulas, the 
discrinoinating investigator will choose technical methods 
adapted to his data and appropriate to his purposes. 

Othee Peoblems Involved in the Consteuction 
OF Peice Index Nxjmbees 

The preceding section has dealt with the technical prob- 
lems connected with the averaging of a given set of data 
in order to secure an index .number of price variations. 
Certain methods have been shown to be quite faulty, while 
certain others have been found to be appropriate for given 
purposes. One who would use index numbers with intelli- 
gence should understand fuHy the methods which have 
been employed in securing given results, in order that he 
may know precisely what the given figure is designed to 
measure and what degree of reliability attaches to it. 

Such problems as these are not the only ones which 
confront those who construct index numbers, nor are these 
considerations the only ones which users of index numbers 
should bear in mind. Of equal importance with problems 
of averaging and weighting are the practical questions con- 
nected with the selection of representative samples. The 
only completely accurate measure of the general level of 
commodity prices would be secured by determining the ratio 
between all money imits, including credit, in circulation 
(accoimt being taken of velocity of circulation) and all the 
ph3rsical units of goods exchanged for money over a given 
period. The measurement of general price changes between 
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two periods would thus iuvolve complete knowledge of these 
two factors for each of the two periods. Such knowledge, 
of course, cannot be had, so recourse must be had to the 
method of sampling. And primary importance attaches to 
the number of commodities and the character of the com- 
modities upon the prices of which a given index number is 
based. 


NUMBER OP COMMODITIES TO BE INCLUDED 

Here again we are confronted with a relation that has 
already been mentioned, the relation between methods and 
uses. Decision as to the number of commodities and the 
kinds of commodities to be included in a given case must 
rest upon the purpose for which the index is to be con- 
structed. Assuming that the index number is to serve as a 
measure of general changes in the price level, the ques- 
tion as to the number of commodities to be included may 
be easily answered — the larger the sample the more rep- 
resentative will be the results. The frequency polygon 
based upon a large sample will approach more closely to the 
ideal curve which would represent all price quotations than 
will that based upon a small sample. Thus, as a measure 
of general price changes, more confidence may be placed 
in the Bureau of Labor Statistics index, which is based 
upon 813 price quotations, than in Bradstreet’s, which was 
based upon 96 quotations, though the latter had particular 
virtues of its own.^ Yet index numbers based upon a small 
number of quotations may not be ruled out as without 
value. Wesley C. Mitchell, whose researches have ma- 
terially increased our knowledge of the price system and of 
the characteristics of index mnnbers, has compared in detail 
index numbers based upon varying numbers of quotations. 
Unexpected similarities are found. Those constructed from 
a limited number of quotations reflect the broad movements 
of prices in much the same way as do those based upon the 

^ Bradstreet’s index was discontinued at tlie end of the year 1937. 
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prices of several hundred commodities. In important details 
there are differences, however, differences which may in- 
volve doubt as to the movement of prices in a given year. 
In such cases the index numbers based upon many quota- 
tions must be accepted as more accmate measures of general 
price movements, provided that the commodities included 
be equally representative of the various elements of the price 
system. 

For other purposes, however, index numbers based upon 
a limited number of quotations may be preferable. This 
is particularly true when a “sensitive” index is desired, one 
that will serve as a forecaster of general price movements 
rather than as a precise measure of changes in the general 
price level. Of this type is the Harvard sensitive price index 
based upon quotations on 13 basic commodities (raw ma- 
terials). The pmT)oses of such an index are served by the 
selection of a limited number of commodities the prices of 
which are subject to extreme fluctuations, rather than by 
the inclusion of a great many commodities. Yet the uses 
to which an index of this type may be put are limited. 
The “sluggishness” of the many-commodities index number 
is a sluggishness which inheres in the price system, and 
which must be reflected in a faithful index of general 
prices. 

The question of the number of commodities to be included 
cannot be discussed apart from that of the character of 
these commodities. The representative character of an index 
number rests in part upon the number of price series in- 
cluded, but the nature of these series is of even greater 
importance. For there are highly significant differences in 
the behavior of the prices of different commodity groups. 
These groups of prices, their interrelations, their behavior, 
their relation to the fimctioning of the eeononoic system 
and to the swings of prosperity and depression, are noatters 
of immediate and practical importance to economists aiid 
business men. 
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FBICE GROUPS IN THE FIELD OP WHOLESALE PRICES 

Since an index number of wholesale prices must rest upon 
sample quotations, the sample must be representative, must 
include commodities whose prices are typical of the various 
elements in the price system. The division into elements 
for this purpose must be based upon the character of the 
price changes peculiar to the different groups. Of the 
groups thus distinguished, the most obvious are those rep- 
resenting different industries. Textile prices and steel prices, 
leather prices and the prices of chemicals are subject to 
different influences. Trade depressions and revivals do not 
affect all industries at the same time or in the same way, 
so that an index of wholesale prices must include quotations 
from all important industrial groups. If preponderant in- 
fluence upon an index is exerted by the prices of certain 
types of commodities, the index, by that much, loses its 
representative character. Thus Bradstreet’s index, it has 
been established, gave greater weight to cotton fabrics, 
hides and leather, and cured meats than was justified by their 
actual importance in trade, a fact which did not detract 
from its utility for some purposes but which lessened its 
value as a representative index of wholesale prices. 

The extent of these differences between the price move- 
ments of commodities in different industrial groups may be 
fippreciated by comparison of the index numbers of whole- 
sale prices of grains and metals and metal products during 
the business recession that began in the summer of 1937. 

In order that an index may be representative it is not alone 
sufficient that all industries be given an appropriate number 
of representatives in the sample. Raw materials and man- 
ufactured goods show characteristic differences in their fluc- 
tuations, and fitting representation must be given to each 
of these groups. Prices of the former are, in general, more 
sensitive to changes in business conditions, their movements 
preceding those of manufactured goods and showing more 
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violent fluctuations. There are several reasons for this. 
Raw materials are traded in for purposes of manufacture 
and sale. When business improves after a period of depres- 
sion, increased demand on the part of consumers (or expected 
increase in demand) leads competing manufacturers to bid 
against each other for raw materials. It is in the raw ma- 
terial markets that the pressure of increased demand first 
centers, and this bidding generally causes prices to rise in 
these markets before the prices of other goods are affected. 
Similarly, at the first evidence of slackening trade manu- 
facturers’ demand for raw materials falls off. Business 
forces pure and simple play in the raw material markets 
with more freedom than in the markets for manufactured 
goods. Hence the tendency of prices in these markets to 
anticipate, in their movements, prices in other commodity 
markets. 

Additional reasons for the greater stability of prices of 
manufactured goods are found in the fact that these prices 
include a greater percentage of stable cost factors, and in 
the control over supply exercised by most manufacturers. 
Wages, interests, rents move more slowly and less violently 
than do commodity prices. The inclusion of these elements 
in commodity prices tends to render these prices more stable. 
Therefore, as commodities move forward from the raw stage 
to their final manufactured condition their prices include 
more and more of these stabilizing elements, and become 
less violent in their fluctuations.^ Control over supply, 
which manufacturers possess in much higher degree than 
primary producers, makes possible the enforcement of defi- 
nite price policies by fabricators. Under these conditions, 
stable prices and variable output are usually found. 

Each of the groups last mentioned contains minor groups 
of commodities with distinct price characteristics. Witlin 
the raw material group there are marked differences between 

I Cf. Mitchell, “The Making and Using of Index Numbers,” BiMeitin £84, 
U. S. Bureau of Labor Statistics, 44 - 4 5, for examples. 
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agricultural products, animal products, forest products, and 
mineral products. Agricultural products are affected by 
weather and crop conditions as well as by business conditions 
and, though subject to price fluctuations of some magnitude, 
reflect prevailing business conditions less accurately than do 
the prices of mineral products.^ Animal and forest products 
appear to stand between these two with respect to the 
faithfulness with which they reflect business conditions in 
their price movements. Thus, in selecting raw materials 
for inclusion in a sample of price quotations from which a 
representative index number is to be constructed, fair weight 
must be given to these various classes.® 

Manufactured goods, again, do not constitute a single 
homogeneous group with respect to their price movements. 
In so far as they are to be used for further production, or to 
undergo further manufacture, they resemble raw materials 
in relation to the bidding of competing manufacturers, and 
their prices, therefore, are characterized by relatively wide 
oscillations. In so far as the demand for them is for the pm- 
pose of final consumption, purely business forces have less 
weight, and their prices are more stable. Related to this 
argument is that which has already been presented, the 
increasing stability of prices as the stable elements of wages 
and overhead charges bulk larger in commodity costs. So, 
again, the sample price quotations from which an index of 
wholesale prices is to be constructed must include prices 
representative of producers’ and consumers’ goods, of goods 
in the intermediate as well as the final stages of manufacttue. 

Other important divisions of the price system exist. The 
behavior of the prices of capital equipment differs from that 
of prices of goods intended for human consumption. The 

1 It should not be inferred from this that there is no rdation between agri- 
cultural production and the prices of agricultural products, and general business 
conditions. The immediate price relation is frequently one of contradictory 
movements, and cycles in agricultural production are not synchronous with 
business cycles. But conditions in these two fields of economic activity are 
mutually related in many ways. 

» C2f. “The Making and Using of Index Numbers,” 47. 
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prices of durable goods differ in their fluctuations from the 
prices of perishable goods. Goods imported into a given 
country and goods exported from that country are usually 
subject to the play of different forces. A representative 
index number of wholesale prices diould be based upon price 
quotations drawn from aU commodity groups marked by 
distinctive modes of behavior, with weight given to each in 
proportion to the relative importance in trade of the com- 
modities in that category. 

Price Comparisons over Time 

In the opening pages of this chapter the fact was noted 
that the degree of dispersion found in frequency distributions 
of price relatives depended upon the length of time covered 
in price comparisons. Hence, on statistical grounds, there is 
justification for the conclusion that the accuracy of well- 
constructed price indices is high for measurements extending 
over a short interval, and becomes progressively lower as 
the range of the time comparison increases. This conclusion 
is supported by other considerations. 

In Laspeyres’ formula, 

j _ Spigo 

SMo 

the price factor alone varies, as between numerator and de- 
nominator. The constant we^hting factor, qo, is assumed to 
deflne quantities entering into trade in an unchanging S 3 rstem 
of income distribution, living standards, consumption habits, 
etc. This ssrstem, for which Sir George Embbs has used 
the term “regimen,” is taken to be common to the two 
periods compared. If it is constant, and if the j’s which 
define its quantitative attributes are rmchanged, then we may 
expect to measure with accuracy the one factor which does 
change — commodity prices. The condition we have here as- 
sumed is the orthodox one of ceteris paribus, the condition that 
factors other than the one subject to study remain constant. 

In fact, of course, the regimen does not remain fixed. 
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Changes in tastes and in consumption habits occur; changes 
in types of goods used as capital equipment take place; 
incomes shift, and the flow of goods is altered by changes in 
the distribution of buying power among consuming groups; 
the very price changes that we seek to measure bring altera- 
tions in the demand for given types of goods, and in the quan- 
tities produced. Of no small moment in the total situation 
are the changes that occur in the quality of goods that con- 
tinue to pass by the same trade names. The automobile of 
1938 is the same commodity, by name, as the automobile 
of 1910, but to the average consmner the later model repre- 
sents quite a different bimdle of utilities. Similarly, steel, 
textiles, locomotives, even the staple articles of diet have 
undergone important quality changes. A comparison of 
price levels in 1910 and 1938 that depends for its accuracy 
on the assumption that all elements of economic life except 
prices have remained constant is suspect, indeed. 

Our difficulties are not removed if we take as the standard 
of reference the regimen of the second of the two periods 
compared. This is done in Paasche’s formula, 

j _ Spigi 
Spo?i 

The system of consumption standards and all that goes 
with it may be of modem vintage in this case, but the 
differences between the regimens of the two periods com- 
pared is just as wide. We have not held constant non-price 
factors, and our measurement of price changes loses in 
accuracy, as a result. 

The method exemplified by the Ideal formula, that of 
employing wei g h ting factors drawn from both periods, rep- 
resents one attempt at the solution of this problem, but it 
is far from perfect. The use of quantities drawn from the 
two regimens does not create a common regimen, the indis- 
pensable condition of full accuracy m such comparisons. 

The practical procedure in the face of this difficulty is to 
restrict our comparisons, if hi^ accuracy is required, to 
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periods not widely separated in time. Consumption habits, 
living standards and technical production methods will be 
not widely dissimilar in two such periods, and hence the 
number of identical commodities common to the two periods 
will be large. Under these conditions considerable confi- 
dence may be placed in index niunbers measuring average 
price changes. Comparison of price levels over longer periods 
may be desired, and may be justified, but the margin of error 
in the measurements may be expected to increase as the time 
span extends. Formal precision in weighting and in the selec- 
tion of acceptable formulas will not provide an escape from 
the unavoidable difficulties arising out of alterations m the 
basic conditions of economic life. Real continuity of in- 
dices covering a stretch of years is possible only on the 
basis of a persisting common regimen. 

These considerations support the claims of an index of 
the chain type, which involves the measurement of price 
changes between successive periods not far apart in time. 
Bruce D. Mudgett has advocated this procedure. The com- 
parison of price levels in two periods, close together in time 
and with similar regimens, will be accurate, if such an index 
as the Ideal be employed. The elements of such a chain 
may then be linked together, in attempting to measure 
price changes between non-consecutive periods. If the regi- 
mens of the non-consecutive periods differ noaterially, the 
accuracy of the comparison will probably not be high. But 
it is reasonable to believe that better results will be secured 
by bridging the intervening years in the manner proposed 
than by constructing a single far-flung index based only 
upon the widely dissimilar regimens of two periods far 
removed in time. 

The WholbsaiiE Pbice Inbbx of the United States 
Bttbeau of Labor Statistics 

The authoritative index of wholesale prices in the United 
States is that compiled by the United States Bureau of Labor 
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Statistics. This index was first constructed in 1902, for the 
period beginning with 1890. It was continued until 1913 
as an unweighted average of relative prices, the base of 
each relative being the average price of the given conunodity 
during the ten-year period 1890-1899. Various revisions of 
procedure have been made since 1913. As it stands at pres- 
ent the index for any given period (week, month, or year) 
is a wei^ted aggregate of actual prices, the aggregate being 
expressed, to facilitate comparison, as a relative with 1926 
as the base. 

The index now includes 813 price series. (A single com- 
modity may be represented by several quotations, the prices 
for different grades or in different markets being given. 
Thus for raw cotton there are three quotations. Middling, 
New Orleans; Middling Upland, New York; and Middling 
Upland, Galveston.) In the derivation of the aggregate for 
any date each price quotation is multiplied by a given weight, 
known as a “quantity weight” or a “multiplier.” This 
same weight is applied to the price quotation for the base 
period. The cross products thus obtained for the base period 
and the date in question are values of a stated quantity of 
goods; they differ only in respect of the price factor. The 
following tabulation illustrates the method as applied to 
cotton: 



Average price^ 
November, 1937 
(par pound) 

Quantity 

Average price, 
November, 1937 
X QuanUty 
weight 

Commodity 

xodght 

(pounds) 


Vh 

Qh 

vm 

Cotton, Middling, 




New Orleans 

$.080 

1,399,496,000 

$111,959,680 

Cotton, Middling 
Upland, N. Y. 
Cotton, Middling 

.080 

77,750,000 

6,220,000 

Upland, Galveston 

.077 

6,297,729,000 

484,925,133 


When this process is carried out for the entire 813 price 
series included, the sum of the values in the last column 
gives the index number for the given period, in this case 
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November, 1937. As published, this sum is expressed as a 
relative, the aggregate in 1926 representing 100.^ The 
formula for the index measuring the level of wholesale prices 
at time “ 1,” with reference to the base level at time “0" is, 
thus, 

Pa Spoff* 

where qn represents the constant multipliers. The method 
of construction renders it possible to shift the base to any 
desired year or month, changing the given relatives to per- 
centages on the new base. 

This index number, therefore, is based upon the cost at 
wholesale of a bill of goods. The bill of goods remaining 
the same, the total cost changes as the prices of the various 
commodities change, and the index measures the ejffect of 
these c hang ing individual prices upon the total cost. 

It is essential, of course, that the quantity used as mul- 
tiplier for each series of price quotations truly represent 
the relative importance of the commodity in question. The 
multipliers employed are approximations to the quantities 
actually marketed. Changes are made from time to time in 
these quantities, the revisions being applied, of course, to 
the base period aggregates as well as to the figures for later 
periods. In addition, when it is necessary to substitute one 
price series for a related one that has been discontinued or 
has lost significance, minor modifications are naade in the 
multipliers so as to maintain comparability between the 
aggregates for periods preceding and periods following the 
date of substitution.* 

The Bureau of Labor Statistics publishes index numbers 
of wholesale prices for 10 major and 45 minor commodity 

1 This base is, at the date of 'writing, twel've years removed in time. Adop- 
tion of a 1935-1937 base period is now being considered. 

* The method of adjustment is explained in an article, Revised Method of 
Calculation of 'the B. L. S. Wholesale Price Index,” by Jesse M. Cutts and 
Samuel J. Dennis, Journal of the American Statistic Aesodation. December, 
1937, 663-674. 
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groups, as well as a general index for all commodities. The 
major groups include farm products, foods, hides and leather 
products, textile products, fuel and lighting , metals and metal 
products, building materials, chemicals and drugs, house fur- 
nishing goods, and miscellaneous commodities. The con- 
stituent elements of the indmr are also classified into raw 
materials, semi-manufactured articles and finished products, 
and measurements of price changes for these groups are 
computed. The National Bureau of EconoDoic Research has 
constructed index numbers for various other categories of 
commodities, utilizing the quotations of the Bureau of Labor 
Statistics. These classes include raw and processed goods, 
durable and non-durable goods, producers’ goods and con- 
sumers’ goods, goods destined for use in capital equipment 
and goods destined for human consumption, foods and non- 
foods, and crops, animal products, minerals, and forest 
products.^ The availability of index munbers for various 
significant classes of goods makes it possible to trace price 
changes with more precision, and to interpret them more 
accurately, than when dependence is placed upon a single 
all-embracing index. For the elements of the price system 
are marked by wide diversity in their behavior over both 
long and short periods of time. 

Other Price Index Numbers 

'The measurement of price changes by the use of index 
numbers has not been confined to wholesale prices. Many 
variations of this device have been utilized in measuring 
price movements in other fields. It will be useful at this 
point briefly to indicate the character of some of these 
variations.® 

* See Prices in Reeetaon and Recovery, N. Y., National Bureau Eoonomio 
Beaearoh, 1936, 492-&40. 

* Detailed information concerning the character and content of a wide vaiiely 
of index numbers, price and other, will be found in An Index to Bueinest Indieet, 
Donald H. Davenport and Frances V. Scott, Chicago, Bunness FubUcations, 
Inc., 1037. 
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INDEX NUMBEES OP RETAIL PRICES 

An index of retail food prices is published currently by 
the United States Bureau of Labor Statistics. The general 
methods employed are Rimilar to those already explained in 
connection with the index of wholesale prices computed by 
that agency, with such differences as inevitably result from 
the natiue of the material. 

Actual retail selling prices of 84 articles of food are secured 
biweekly from dealers in 51 representative cities throughout 
the United States. In weighting the quotations on foods 
of a single type (fresh vegetables, for example) in a given 
city, account is taken of the quantities of such foods con- 
sumed by an average wage-earner’s family m that city or, 
for some regions, in the district in which that city lies. In 
obtaining weights consumption by food groups is considered, 
rather than by specific commodities, since the commodities 
actually priced must be taken to represent similar commodi- 
ties for which no prices are collected. 

The combination for a single city (or geographical area) 
of food prices thus weighted yields an index for that region. 
The food cost index for the United States is computed from 
the aggregates for the 51 cities, each wei^ted according to 
the population of the area which the city is taken to repre- 
sent. Thus the weights entering the final index of retail 
food prices for the coimtry as a whole represent quantities 
consumed by the average wage-earner’s family, and the 
population assumed to be affected by each series of quoted 
prices. The base of the index numbers, as published, is the 
average of the three-year period 1923-1925. 

The indices of retail food prices, together with index 
numbers of the prices of electricity and coal, at retail, are 
published in the Monthly Labor Review. 

The difficulties inherent in the problem of measuring 
wholesale price movements have been discussed at some 
length. The construction of index numbers of retail price's 
of the type just described presents even greater difficulties. 
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All the theoretical problems arising in the former case are 
to be solved and, in addition, the practical difficulties of 
securing suitable weights, accurate price figures, and com- 
parable quotations are intensified. Because of the lack of com- 
modity standardization, and because of variations in business 
practice and local customs, the latter difficulty is particularly 
acute. For these reasons no index of retail prices at present 
published can be accepted with the confidence with which 
the best indices of wholesale prices may be received. 

INDEX NUMBEES OP THE COST OP LTVINQ 

If these problems are acute in constructing an index of 
retail prices they are doubly hard to solve in meastiring 
such an entity as the cost of living. When food prices, 
rents, retail clothing prices, cost of fuel and light, retail 
fumitiu'e prices, and the prices of the other miscellaneous 
items which are included in the budget of the average family 
are to be averaged, and an index number constructed to 
measiu'e variations in the cost of these items, numerous 
statistical difficulties must be overcome. Theoretical ques- 
tions concerning the most suitable methods of averaging 
and weighting present themselves, but more important are 
the practical problems involved in the collection of accurate 
and comprehensive prices and weighting data. 

Two index numbers of the cost of living are currently 
compiled in the United States, one by the Bureau of Labor 
Statistics, one by the National Industrial Conference Board 
of New York. The former appears in the Monthly Labor 
Review, the latter in periodic publications of the Conference 
Board. In each case the chief items of domestic expenditure 
are weighted in accordance with their relative importance in 
household budgets, and the combined results e:^ressed as 
relative numbers. These are given on the 1913 and 1923—1925 
base by the Bureau of Labor Statistics, on the 1923 base 
by the Conference Board.^ 

^ For a general discussion of the problem) with details of the Conference 
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INDEX NIJMBBBS OF PBICE AND BDYING POWER OP 
FARM PRODUCTS 

A set of useful index numbers relating to the prices re- 
ceived by and the prices paid by farmers is compiled by the 
United States Department of Agriculture. The first of these 
is based upon the prices at the farm, as of the middle of 
each month, of 34 major farm products and 13 commercial 
truck crops. The weights employed are the average quan- 
tities marketed by farmers during the period 1924^1929. 
Farmers and agricultural economists have need of such a 
specialized index, because the wholesale prices of farm prod- 
ucts in the great exchanges or in large cities are often poor 
representatives of the prices actually received by farmers. 

The index of prices paid by farmers is compiled quarterly 
(in March, June, September, and December). The constitu- 
ent quotations are retail prices paid by farmers for commod- 
ities used in family maintenance and in production. Weights 
are estimated quantities bought by farmers. The base of 
the index of farm prices, as published, is the average of the 
five pre-war years from August, 1909 to July, 1914; that of 
the index of prices paid, 1910-1914. Measurements for sub- 
groups are given, in both cases. 

These two index numbers are used in the derivation of an 
index of the purchasing power of farm products. The com- 
putation of the purchasing power index may be illustrated 
with reference to the figures for 1936. In that year the index 
of prices of farm products was 114. The index of prices paid 
by farmers was 124. That is, the farmer was receiving 
14 per cent more, on the average, for a imit of product than 
in 1909-1914, but the average price paid by bim for a unit of 
goods purchased was 24 per cent higher than in the base 
period. Therefore the purchasing power of an average unit 
of farm products was 8 per cent less than in 1909-1914 
(114-5- 124=. 92). 

Board procedure, see Cost cf Living in the United Statee, 1914-1986, M. Ada 
Beney, New York, National Industiial Conference Board, 1936. 
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These three index numbers, for selected years, are given 
in Table 57. 

Table 67 

Index NuTribers of Farm Prices, Prices Paid by Farmers, and the 
Buying Power of Farm Products ^ 


(1) 

(2) 

(3) 

(4) 

Average yer unit 

Yecsr 

PriceB received 
by farmers 

Prices paid 
by farmers 

purchasing power 
of farm products 
(2) -r (3) 

1910-1914 

100* 

100 

100 

1918 

202 

176 

115 

1920 

211 

201 

105 

1921 

125 

152 

82 

1925 

156 

157 

99 

1929 

146 

153 

95 

1932 

65 

107 

61 

1933 

70 

109 

64 

1934 

90 

123 

73 

1935 

108 

126 

86 

1936 

114 

124 

92 

1937 

121 

131 

93 


* Aug., 1909^uly, 1914 - 100. 

' These are significant measurements, yielding valuable in- 
formation concerning the buying and selling relations that 
vitally affect one important group of producers. The devel- 
opment of similar measurements for other groups mil add 
materially to our understanding of the changes that shifting 
market relations entail, in the economy at large. Yet the 
limitations of these index numbers should not be overlooked. 
The measurement of prices paid by farmers and, correspond- 
ingly, the measurement of the purchasing power of farm 
products, are subject to the difficulties referred to in the 
discussion of retail prices and living costs. Under existing 
conditions the margin of error in all such measurements is 
fairly wide. The error is the greater, too, the longer the 
time span covered by the quotations. In the present case, 
goods bou^t by farmers have undergone greater changes 
^ Source: The Agricvlkaral SUmtionf XJ. S. Bureau of Agricultural Economics. 
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in quality than have the fairly standardized staples that 
the farmer sells. Here, as in all price comparisons over time, 
greater confidence must attach to short-period comparisons 
than to those spanning several decades. 
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CHAPTER VII 


THE ANALYSIS OF TIME SERIES: MEASUREMENT 

OF TREND 

The preceding sections have dealt primarily with frequency 
series and with problems arising in the attempt to organize 
and describe such series. We are now concerned with 
data in the study of which the essential problem is the 
analysis of chronological variations. Such series are of 
major importance in the field of economic statistics, for 
most of the data of economics and business are variables in 
time — as bank clearings, steel production, volume of sales, 
etc. This dominating importance of series in time is not 
found in any other field of statistical research, and the 
development of methods of analysis appropriate to time 
series has come, accordingly, only within recent years with 
the wider adoption of statistical methods in the field of 
economics. 

Problems coimected with time series arise both in the 
ordinary routine of internal administration and in the 
analysis of general economic conditions. Sales, purchases, 
profits on the one hand, stock prices, interest rates, business 
failures on the other, are variables which fluctuate with the 
passage of time. In the analysis of such series it is generally 
desired that the rate and character of growth be determined, 
and that periodic and accidental fluctuations be isolated 
for study. The sales manager wishes to know how the vol- 
ume of sales is faring, when and why it fluctuates and how 
it compares with volume of production. The economist 
desires to trace the trend of prices, and to scrutinize minutely 
the upward and downward movements of the price level. 
The mn.king of business plans on even a small scale, as 

as 
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well as the most elaborate schemes of economic forecast- 
ing, must rest upon such study of past trends and fluctua- 
tions, and upon comparison of the movements of related 
series in time. Scientific study of the business cycle is only 
possible through the application of such methods. Our 
present task is the development of methods appropriate to 
the analysis of series in time. 

The Pbeuminart Obganization of Time Series 

The data of time series usually require less preliminary 
organization than do statistical data which are to be reduced 
to the form of a frequency distribution. The source, pri- 
mary or secondary, from which the figures are taken usually 
presents them in shape for analysis. Certain precautions 
should be observed, however. 

The dates to which the figures apply should be clearly 
understood and definitely stated. Monthly data may be 
as of the first of each month (as for the stock price index 
of the New York Stock Exchange), averages for each month 
(as for the Bureau of Labor Statistics’ price index), or 
totals for each month (as in the case of figures on cotton 
consumption). They may be cumulative monthly figures, 
each item representing the total for the year to date, as 
in the case of certain coal production data. If average 
figures are given for a month or year it is important to 
know how the average has been secured. 

Again, it is essential that in any time series there be 
strict comparability between data for different periods. 
Any attempt to analyze a series that is not homogeneous 
must be misleading and futile. Yet such series are not 
infrequently published. Commodity production or con- 
sumption figures published by trade associations and by 
governmental agencies are sometimes based upon returns 
from a varying number of reporting concerns. A series 
of price quotations may lack comparability as between 
different dates because of changes in the unit or grade to 
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which the quotations apply, or because quotations are 
drawn from different markets. Changes in census classifica- 
tions may result in lack of comparability of census data. 
A change in a salesman’s territory may alter his returns 
materially. It is stated that the character of the obligations 
represented by the United States Steel Corporation’s figures 
for “unfilled orders” has varied from time to time. Records 
relating to the physical output of a given commodity in 
different periods may be rendered inaccurate by changes 
in quality or design. These are examples of faults that 
may be found m time series, rendering analysis futile. 
Strict testmg is essential before a series be accepted as 
accurate and homogeneous. 

GKAPHIC REPBESENTATION OP TIME SERIES 

Normally the first step to be taken m visualizing a series 
in tune and in preparing for further analysis consists of 
plotting the data. The trend and general characteristics 
of a series may be most readily apprehended through 
graphic representation. The data may be plotted on ordinary 
arithmetic or on semi-logarithmic paper. The advantages 
of the latter types for certain purposes have already been 
explained. The choice in a given case will depend upon the 
nature of the data and the object of the study. If interest 
lies in the absolute amoimt of fluctuations m sales, prices, 
pig iron production or whatever may be in process of 
analysis, or in the comparison of absolute differences between 
series, the ordinary rectilinear chart is to be employed. 
If percentage variations and the comparison of relative 
fluctuations are matters of interest, the semi-logarithmic 
representation is preferable. In general, if one is accustomed 
to the interpretation of this latter type of chart, its use is 
advisable. A clearer, less-distorted presentation of relations 
and a more significant comparison of series are generally 
secured when economic data having time as one variable 
are plotted on paper with a logarithmic ruling on one ads. 
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For some purposes the process of studying series in time 
will have been completed when the data are thus plotted. 
The general trend may be roughly determined from the 
chart. The existence of seasonal and other periodic varia- 
tions may be ascertained. Rough comparisons of trends 
and fluctuations may be made. All the knowledge thus 
secured, it should be noted, will be non-quantitative in 
character, and the comparisons will be tentative and 
approximate. Even so, such charts enable trends and 
relations to be much more clearly visualized than do the 
raw figures, and for some purposes the knowledge thus 
secured is sufficient, though it lacks precision and accuracy. 
For other purposes more exact measurement and more 
refined analysis are required. Certain appropriate methods 
may be described. 

Forces Affecting Series in Time 

The general object in the analysis of a time series is the 
isolation of the effects of one or more of the forces affecting 
the given series. This may be desired in order that the 
past behavior of the single series may be understood, in 
order that the future behavior of the series may be pre- 
dicted, or in order that two or more series may be compared. 
It is not in any case possible to isolate these effects of 
individual forces with absolute accuracy, and in some cases 
it is impossible even to approximate such a result. But 
given figures covering a sufficiently long period, the effects 
of various influences upon the behavior of a given series 
may usually be measured with some degree of accu- 
racy. 

What are these forces that affect series of data in time? 
The forces in any given case may be imique, affecting only 
the given series, but in general the various influences acting 
upon such series may be placed in a limited number of 
categories. 
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SECULA.R TREND 

In the first place, most series of economic statistics exhibit 
definite trends. Such a trend may be constant in direction, 
may change direction at a constant rate, or may even be 
characterized by abrupt shifts in direction or rate that 
reflect the introduction of novel elements. Thus the volume 
of production or sales of a business house over a period 
of years usually shows a fairly regular growth. The same 
is true of population, the production of basic minerals, 
the number of motor vehicles registered, etc. In some 
cases the rate of growth may be a negative one, as is true 
of interest rates in the United States over the last half 
century. The concept of secular trend (i.e., trend over a 
long period of time) covers both positive and negative 
changes of this t3rpe. 

In the analysis of a tune series the trend value at any 
date is taken to be the “normal” value at that date. This 
conception of a normal value which may be used as a base 
or point of reference in judging the effects of all forces other 
than the growth factor, is fundamental in economic analy- 
sis. “No other method,” says Carl Snyder, “enables us so 
quickly to set economic events in their just perspective.” We 
should note, however, that such a normal value is essen- 
tially an empirical construction. While useful for purposes 
of reference, and as one of a series of measurements reflect- 
ing secular movements in a given series, it should not be 
assumed to possess any special normative significance. 

The fact should be emphasized that by secular trend is 
meant the smooth, regular, long-term movement of a statis- 
tical series. Frequent and sudden changes either in absolute 
amounts or in rates of increase or decrease are quite incon- 
sistent with the idea of secular trend. It is true that there 
may be occasional changes due to the interjection of a 
new element or the withdrawal of an old factor. But the 
breaking up into numerous sections of the period covered 
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by a timft series, and the determination of trend for each 
of these minor periods, does violence to the very concept 
of secular change. 

It does not follow from this discussion that a definite 
rising or declining trend exists for all time series. Many 
series, such as barometric readings at a certain point, 
merely fluctuate about a constant level that does not change 
with the passage of time. 

PERIODIC FLUCTUATIONS 

If the plotted representation of a time series be studied, 
the long-term trend may be discerned in the general upward 
or downward drift, but may not be precisely determined 
by inspection because of the existence of numerous fluctua- 
tions, superimposed upon the trend. These fluctuations 
may be regular or irregular, violent or mild, simple or 
complex. The value of the variable at any given date 
may be thought of as the net resultant of the interaction 
of the secular trend and the various forces that tend to 
modify the persistent secular movements of a given series. 
(It may be, in fact, that for many series the trend is the re- 
sultant of the interplay of a variety of conflicting forces, 
rather than an imderlying movement upon which the peri- 
odic and other fluctuations are superimposed. In the present 
discussion no attempt is made to define the organic relations 
that may exist among the forces affecting a series in time.) 
These latter forces may be of several types. 

Seasonal variations are foimd in most series of economic 
statistics for which quarterly, monthly, or weekly values 
are obtainable. Consumption and production of commodi- 
ties, interest rates, bank clearings, railroad frei^t traffic, 
and many other types of data are marked by seasonal 
swings repeated with minor variations year after year. 
These, in so far as they exist at all, are definitely periodic 
in character, with a constant twelve-month period. Less 
markedly periodic, but nevertheless characterized by a 
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considerable degree of regularity, are the cydicd fliu±uaticms 
that are found in series affected by forces connected with 
economic or business cycles. Prices, wages, the volume 
of industrial production, trading on the Stock Exchange, 
and most series relating to the activities of individual 
business units are affected by the swings of business through 
alternating periods of depression and prosperity. While 
the length of such periods may vary, the general sequence 
of change has been in the past sufficiently regular to render 
these cyclical movements capable of study. 

RANDOM FLUCTUATIONS 

Entangled with these more or less regular movements 
are the effects of random, accidental, and irregular fluctua- 
tions — catastrophic events such as the San Francisco earth- 
quake, wars, floods, fires, and countless minor events equally 
fortuitous though less violent in the resulting disruptions. 
These events influence the value of a variable at a given 
date, modifying the effects of long-term movements and of 
seasonal and cyclical factors. The observed value at any 
time is the resultant of the play of all these forces. 

The analysis of series in time involves the isolation of 
the effects of these various forces, so far as this is possible. 
A problem may call for the study of but one factor, or it 
may require the complete breaking up of given values. 
When annual data are used the seasonal element will not 
enter, of course. The explanation of methods begins with 
a consideration of problems involving only this type of data. 

The Measurement of Secular Trend 

As an example of the t 3 q)e of material in coimection with 
which these problems arise, the figures in Table 58 on page 
232 may be taken. The values are given in thousands of 
millions in order to simplify the calculations. 

As has been pointed out, the figure for any year, as the 
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Table 58 

New York Clearing House Transactions, 1875-1936 
(In thousands of millions) 


1875 

25 1 

1891 

34 1 

1876 

21 6 

1892 

36.3 

1877 

23.3 

1893 

34.4 

1878 

22 5 

1894 

24 2 

1879 

25 2 

1895 

28 3 

1880 

37 2 

1896 

29 4 

1881 

48 6 

1897 

31 3 

1882 

46 6 

1898 

39.9 

1883 

40.3 

1899 

67 4 

1884 

34.1 

1900 

62 0 

1886 

25.3 

1901 

77 0 

1886 

33.4 

1902 

74.8 

1887 

34.9 

1903 

70.8 

1888 

30 9 

1904 

69 7 

1889 

34.8 

1905 

91 9 

1890 

37.7 

1906 

103.8 


1907 

96 3 

1923 

214 6 

1908 

73 6 

1924 

236 5 

1909 

99 3 

1926 

276.9 

1910 

102 6 

1926 

293 4 

1911 

92 4 

1927 

307 2 

1912 

96 7 

1928 

368 9 

1913 

98 1 

1929 

456.9 

1914 

89 8 

1930 

399.6 

1916 

90 8 

1931 

287.7 

1916 

147 2 

1932 

177.3 

1917 

181.5 

1933 

154.6 

1918 

174.6 

1934 

162.7 

1919 

214.7 

1936 

174.4 

1920 

262.3 

1936 

186.5 

1921 

204.1 



1922 

213.3 




value of $162.7 thousands of millions for 1934, is the net 
'resultant of the many forces that we have classified under 
the headings of secular trend, cyclical variations, and ran- 
dom or accidental fluctuations. Our first problem is to 
measure the secvilar trend. 

In Fig. 64 the data of New York bank clearings during 
the period 1875-1936, inclusive, have been plotted. A 
definite trend is apparent, together with well marked and 
more or less regular deviations from that trend. Several 
methods are available for arriving at approximations to 
this trend. By employing moving averages an attempt may 
be made to eliminate passing fluctuations and to arrive 
at values that define the influence of the steadily operating 
growth factor. If we assume that a definite functional 
relationship prevails (empirically at least) between the time 
factor and the other variable, an approximation to the 
trend may be secured by fitting an appropriate curve to 
the plotted data. Smoothing the data by hand gives some- 
what the same result, the curve being frankly approxi- 
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maiive and empirical in character. In certain studies it 
been found possible to use one statistical series as base 
or trend line for another series of homogeneous data. 


Moving Averages 

When a trend is to be determined by the method of 
moving averages, the average value for a number of years 
(or months, or weeks) is secured, and this average is taken 
as the normal or trend value for the unit of time falling 


Table 59 


New York Clearing House Transactions, 1912-1936, and 3-, 6-, 7-, 
and 9-Year Moving Averages 
(In thousands of millions) 


Year 

Original 

data 

Three-gear 
moving as. 

Five^ear 
moving av. 

Seoenryear 
moving as. 

Nine-year 
moving as. 

1912 

$ 96,7 





1913 

98.1 

$ 94.87 




1914 

89 8 

92 90 

3104 52 



1915 

90.8 

109.27 

121.48 

$125.51 


1916 

147.2 

139 83 

136.76 

142.37 

$149.51 

1917 

181.5 

167 73 

161.74 

164 40 

161.44 

1918 

174.5 

190.23 

194.04 

180.73 

174.24 

1919 

214 7 

213.83 

205 42 

198 23 

188.11 

1920 

252.3 

223.70 

211 78 

207.86 

204 19 

1921 

204.1 

223 23 

219.80 

215.57 

218 60 

1922 

213.3 

210.67 

223.96 

230 20 

231.03 

1923 

214.6 

221.13 

228.88 

241.44 

245.78 

1924 

235.5 

242 33 

246.74 

249.29 

262.91 

1925 

276.9 

268.60 

265 52 

272.83 

285.64 

1926 

293.4 

292.50 

296.38 

307.63 

307.36 

1927 

307 2 

323.17 

340 66 

334.04 

315.62 

1928 

368 9 

377 67 

365.18 

341.50 

311 48 

1929 

456 9 

408.43 

364.04 

327.27 

302 49 

1930 

399.5 

381.37 

338.06 

307 44 

289.80 

1931 

287.7 

288.17 

295.20 

286.80 

276.58 

1932 

177.3 

206.53 

236 36 

259.01 

263.17 

1933 

154.6 

164 87 

191.34 

220 39 


1934 

162.7 

163.90 

171.10 



1935 

174.4 

174.53 




1936 

186.5 
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at the middle of the period covered in the calculation of 
the average. Table 59 shows the results secured when three-, 
five-, seven-, and nine-year moving averages are thus 
computed for New York clearings for the period 1912-1936. 

The three-year moving average for 1916 is the average 
of the figures for 1915-16-17, the five-year figure for 1916 
is the average of the years 1914-15-16-17-18. The other 
averages are computed in the same way. In each case the 
average is centered for the period included; that is, the 
average is taken to represent normal as of the middle 
of the given period. The employment of an odd number 
of years simplifies this centering process, though it is not 
essential that the number be odd. With an even number 
of years, the figure may be centered by taking a two-year 
moving average of the first moving average. The three- 
and nine-year moving averages for the entire period are 
plotted, with the original data, in Fig. 54. 

It is obvious that the effect of the averaging is to give a 
smoother curve, lessening the influence of the fluctuations 
that pull the annual figures away from the general trend. 
The longer the period included in securing each average, 
the smoother is the curve secured, though there are other 
factors to consider in deciding upon the length of the period. 
Certain of these factors may be noted. 

CHABACTEBISTICB OF MOVING AVERAGES 

Given cyclical fluctuations about a uniform level or about 
a line ascending with a uniform slope, the length of the 
cycle and the magnitude of the fluctuations being constant, 
a moving average having a period equal to the period of 
the cycle (or to a multiple of that period) will give a straight 
line, a perfect representation of the trend. Under the 
same conditions a moving average having a period greater 
or less than the period of the cycle will give, not a strai^t 
line, but a new cycle having the same period as the original, 
but with fluctuations of less magnitude. The minima and 
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Tinn.TnTnn. of the cycles thus obtained will not necessarily 
coincide with the minima and maxima of the origiaal cycles. 
In general, when such a new cycle is obtained the magmtude 
of the fluctuations will be less the longer the period on 
which the average is based.^ 

These propositions may be illustrated by the figures in 
Table 60, arbitrarily chosen. In the first example five 
figures have been selected which repeat themselves in 
sequence, fluctuatiug about a common level. 

The moving averages in colurons (2) and (3) represent 

TABIiB 60 


lUustrating the Application of Moving Averages 


(1) 

(2) 

(3) 

(4) 

(6) 

Cyclical Moving average 
daia of 5 items 

Moving average 
cf 10 item 
\centered) 

Moving average 
of 3 items 

Moving average 
of 8 items 
(centered) 

2 





6 



5i 


8 

6i 


8 


10 

6i 


n 


6 

6i 


5f 

6t 

2 

6i 

6t 

4i 

64J 

6 

6i 

6i 

6i 

64 

8 

6i 

6i 

8 

64 

10 


6i 

7f 

6H 

5 

6i 

6i 

5* 

64 

2 

6i 

6i 

4i 

6H 

6 


6i 

5i 

64 

8 

6i 

6i 

8 

54 

10 

6i 

6i 

71 

544 

5 

6i 

6i 

61 

64 

2 

6i 


44 

644 

6 

6i 


64 


8 

61 


8 


10 



7| 



5 


(The items in oolunms (3) and (5) have been centered by means of a 
moving average of 2 items.) 

' The decrease in the magnitude of the fluctuations is not r^ular, hotreVer; 
but cyclical. 
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the data with the cycles completely removed. When the 
period of the average is not equal to the period of the cycle, 
or to a multiple of that period, the cycle is not removed, 
as is apparent from the figures in columns (4) and (5). 

The conclusions suggested above hold when the cyclical 
fluctuations take place about any straight line. In Table 61 
the foregoing data have been employed but with a constant 
increment of 3. This is equivalent to superimposing the 
same cycles upon a line with a slope of 4- 3. 

Table 61 


Illustrating the Application oj Moving Averages to a Series vyith 

Linear Trend 


(1) 

(2) 

(3) 

(4) 

(6) 

Cyclical 

data 

9 

Moving average 
of 5 item 

Moving average 
of 10 item 
{centered) 

Moving average 
of 3 items 

Moving average 
of 8 items 
(centered) 

9 



8* 


14 

m 


14 


19 

15i 


16* 


17 

18i 


17* 

18t 

17 

21i 

21* 

19* 

21** 

24 

24i 

24* 

23* 

24* 

29 

27t 

27* 

29 

26* 

34 

30i 

30* 

31* 

29H 

32 

33i 

33* 

32* 

33* 

32 

36i 

36* 

34* 

36** 

39 

39i 

39* 

38* 

39* 

44 

42i 

42* 

44 

41* 

49 

46J 

45* 

46* 

aaul 

All 6 

47 

48i 

48* 

47* 

48* 

47 

51i 


49* 

51** 

'54 

54i 


53* 


59 

57* 


59 


64 



61* 


62 






(The items in columns (3) and (5) have been centered by means of a 
moving average of 2 items.) 


The trend values, with the effect of the cycles completely 
removed, are secured by taking moving averages equal in 
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period to the cycle or to a multiple of that period. The 
cycle persists, with the same period but with diminished 
amplitude, when the average is based upon a period not 
equal to that of the cycle, as is clear from the figures in 
colxmms (4) and (5). 

When these ideally simple conditions of constant period 
and amplitude do not exist, the moving average becomes 
more ambiguous and its interpretation less simple. If the 
period of the cycle varies, the selection of a period for the 
moving average is more difficxilt. In general, a period 
equal to or greater than the average length of the cycle 
is to be selected. An average having a shorter period will 
give a line that is marked by pronounced cycles, these 
cycles being reduced as the period covered in the calculation 
of the average increases. 

When the amplitude of the cycle varies, the period being 
constant, a moving average with a period equal to the 
length of the cycle will give a line of trend marked by 
minor cycles. The amplitude of these secondary cycles 
will be a miniTmiTn when the period of the average is equal 
to the period of the cycle (or to a multiple of that period). 
When these last two irregularities are combined, and the 
data are characterized by cycles of varying amplitude and 
of varying length, the moving average giving the most 
effective representation of the trend is that which has a 
period equal to the average length of the cycle, or to a 
multiple of that length. 

A new factor enters when the trend departs from line- 
arity. If the underlying trend of a series is concave upward, 
a moving average will always exceed the actual trend value; 
if the reverse is true, and the trend is convex upward, a 
moving average will always be less than the actual trend 
value. 

These conditions are depicted in the following examples. 
The figures in Table 62 give the values secured when a 
cycle of constant period and amplitude, as in col. (3), is 
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superimposed upon a line of trend that is concave upward, 
i.e., increasing at a constantly increasing rate. If the mov- 
ing average could completely eliminate the effects of the 
cycle, the values secured from the average would be equal to 
the average value of the five items in each cycle (6.2) plus 
the values of thefimction y = x^, given in col. (2). 

Table 62 


lUuatrttting the Application of Moving Averages to a 
Non-Linear Series 
(lacreasing rate) 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

X 


Cydiedl 

Col. (2) plus 

Moving average 

True trend 
values 
{x* + 6.2) 

JU 

data 

col. (3) 

gf 5 items 

0 

0 

2 

2 



1 

1 

6 

7 



2 

4 

8 

12 

12.2 

10 2 

3 

9 

10 

19 

17.2 

15.2 

4 

16 

5 

21 

24.2 

22.2 

5 

25 

2 

27 

33.2 

31.2 

6 

36 

6 

42 

44.2 

42.2 

7 

49 

8 

57 

57 2 

55.2 

8 

64 

10 

74 

72.2 

70.2 

9 

81 

5 

86 

89.2 

87.2 

10 

100 

2 

102 

108.2 

106.2 

11 

121 

6 

127 

129.2 

127 2 

12 

144 

8 

152 

152.2 

150.2 

13 

169 

10 

179 

177.2 

175.2 

14 

196 

5 

201 

204 2 

202.2 

15 

225 

2 

227 

233.2 

231.2 

16 

256 

6 

262 

264.2 

262.2 

17 

289 

8 

297 

297.2 

295.2 

18 

324 

10 

334 



19 

361 

5 

366 




The values of the moving average are, in this case, in 
excess of the true trend values, a form of distortion that 
will always occur with a series of this type. 

In Table 63 are shown the results of superimposing the 
sHTTift cyclical values upon a line of trend that is convex 
upward, i.e., increasing at a constantly decreasing rate. 
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In this case, a perfect method of eliminating the cycles 
would give results equal to the average value of the five 
items (6.2) plus the values of the fimction y = Vx. 

In this case the moving average values are consistently 
too low. The discrepancy is greatest for the lower values 
of X, as the decrease in the rate of growth is most marked 
for these values. 


Table 63 


Illustrating the Application of Moving Averages to a 
Non-Linear Series 
(Decreasing rate) 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

True trend 


y/x 

CydicdL 

CoL (2) plus 

Moving average 

valves 

(V® + 6 2) 


data 

col (3) 

of 5 item 

0 

0 

2 

2 00 



1 

1 00 

6 

7.00 



2 

1 41 

8 

9 41 

7.428 

7 61 

3 

1 73 

10 

11 73 

’ 7.876 

7 93 

4 

2 00 

5 

7.00 

8 166 

8 20 

5 

2 24 

2 

4 24 

8 414 

8 44 

6 

2 45 

6 

8.45 

8 634 

8.65 

7 

2 65 

8 

10 65 

8 834 

8 85 

S 

2.83 

10 

12 83 

9 018 

9 03 

9 

3 00 

5 

8 00 

9 192 

9.20 

10 

3 16 

2 

5.16 

9.354 

9 36 

11 

3 32 

6 

9.32 

9.510 

9 52 

12 

3 46 

8 

11.46 

9.658 

9 66 

13 

3 61 

10 

13.61 

9.800 

9 81 

14 

3 74 

5 

8 74 

9 936 

9.94 

15 

3 87 

2 

5.87 

10.068 

10 07 

16 

4.00 

6 

10.00 

10 194 

10.20 

17 

4.12 

8 

12.12 

10.318 

10.32 

18 

4.24 

10 

14.24 



19 

4.36 

5 

9.36 




Considerations previously reviewed have indicated that 
a moving average should, in general, be based upon a 
period at least equal to the period of the cycle, and prefer- 
ably equal to some higher multiple of that period when the 
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data are at all irregular. The longer the period covered, 
the greater the stability of the average. But when the 
underlying trend departs materially from the linear form, 
following a curve bending upward or downward, the error 
involved iu the \ise of any moving average increases as 
the period of the average increases. If a moving average 
is used in such a case to measure the trend, the period of 
the average should be the shortest which will serve to 
average out the cycles; equal, that is, to the average length 
of one cycle. 

In practice, however, these various conditions are fovmd 
in complicated combinations. The fact that cycles vary in 
amplitude and length calls for a moving average based 
upon a fairly long period. The fact that the trend of the 
data is usually non-linear calls for a short period average 
to lessen the upward or downward distortion. A considera-’ 
tion of some importance in practical work is that a moving 
average can never be brought up to date. The lag is less, 
of course, the shorter the period covered by the average. 
The selection of a period iu a given case must rest upon a 
study of the actual data with these various considerations 
in mind. 

It has been assumed in the preceding discussion that the 
purpose of the moving average is the representation of 
secular trend. The moving average may be used, also, in 
smoothing data for the purpose of eliminatiag random 
fluctuations. For this purpose a moving average based 
upon a period shorter than the average length of the cycle 
should be selected. 

We may return now to the problem relating to New York 
bank clearings. A study of the lines marked out by the 
different moving averages in Fig. 54 reveals significant 
differences between them. The three-year average follows 
the graph of the original data most closely, as would be 
expected. The nine-year average marks out the smoothest 
line of trend, but, on the other hand, departs most widely 
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from the data. This is particularly noticeable from 1893 to 
1898, from 1911 to 1916, from 1921 to 1926, and from 1927 
to 1931. It is due to the pronounced changes in the rate 
of growth of the series during these periods. Except for 
these distortions the general trend seems to be most accu- 
rately represented by the nine-year average. 

In determining the relative merits of the different moving 
averages we are aided by a knowledge of the course of 
business during the period covered. The volume of New 
York bank clearings is a sensitive index of general business 
conditions, responding immediately to changes in specula- 
tive and industrial activity. Major and minor business 
cycles are reflected in this series. Enowing the number of 
cycles through which business has passed during the period 
1875-1936, we may determine which of the moving averages 
serves best as a standard from which to measme cyclical 
deviations. In this case we are practically working back- 
ward from a known result, a method not always available. 

If we take as a starting point in each cycle the year in 
which revival began, after recession, the following cycles 
in general business activity may be distinguished:^ 


1871-1879 

1908-1912 

1879-1885 

1912-1914 

1885-1888 

1915-1919 

1888-1891 

1919-1921 

1891-1894 

1921-1924 

1894^1897 

1924-1927 

1897-1900 

1927-1933 

1901-1904 

1933- 

1904-1908 



The cycles marked out by the three-year moving average 
are too numerous to enumerate. In fact, the deviations 
from this average are primarily accidental and minor 

^ These dates are based upon the chronology of American business cycles 
developed by Wesley C. Mitchell; cf. “Production during the American Busi- 
ne^ C^de of 1927-1933,” by W^ey C. Mitchdl and Arthur P. Bums, Bvl- 
letin 61, National Bureau of Economic Research, November 9, 1936. It should 
be noted that the chronology is based upon monthly data, whereas the Clearing 
House data dted in the text are flunual figures. 



MOVING AVERAGES 248 

fluctuations and should not be classed as cycles. Deviations 
from the five-, seven-, and nine-year averages mark out the 


following cycles: 
Cycles of deoiations 

Cycles of deviations 

Cycles cf deviations 

from fme-year 

from sevenr-year 

from nine^ear 

moving averages 

moving averages 

moving averages 

1879-1885 

1879-1886 

1879-1885 

1886-1888 

1885-1888 

1885-1888 

1888-1891 

1888-1894 

1888-1897 

1891-1897 

1894-1900 

1897-1900 

1897-1900 

1900-1904 

1900-1904 

1900-1904 

1904-1908 

1904r-ig08 

1904^1908 

1908-1911 

1908-1915 

1908-1911 

1911-1915 

1915-1923 

1911-1916 

1915-1918 

1923- 

1915-1918 

1918-1923 


1918-1924 

1923-1927 


1924r-1927 

1927-1932 


1927-1932 

1932- 


1932- 




Some of the differences between the series of cycles thus 
determined and the reference cycles distinguished by 
Mitchell are doubtless due to the distinctive behavior of 
New York clearings. Other differences reflect the peculiari- 
ties of moving averages. Deviations from the five-year 
averages between 1879 and 1927 show one more cycle than 
we find in the series based on seven-year averages, four 
more cycles than are shown by the nine-year averages. 
And yet the deviations from five-year averages fail to show 
the cycles of 1894^1897 and of 1921-1924. The nine-year 
averages reveal only eight cycles between 1879 and 1927, 
as against Mitchell’s fourteen reference cycles. Mitchell 
was working, of course, with monthly data which are 
more sensitive than annual data to cyclical forces. More- 
over, he was dealing with relatively short movements, some 
of which appear as only minor fluctuations in general business 
activity. 

If interest attaches to the shorter swings of busiaess, 
to cycles with average durations of three or four years, 
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a moving average of relatively short period should be used. 
A five-year average is appropriate. Averages of longer 
period define trend movements more faithfully, but may 
fail to reveal fiuctuations properly classified as business 
cycles. We should refer, however, to recent attempts to 
establish the reality of long cycles, of nine, eleven, or as 
many as thirty years in average duration. In the study 
of such cycles moving averages of corresponding periods 
would be employed. 

In general, the moving average has the prime advantage 
of flexibility. The representation of secular trend by mathe- 
matical curves frequently involves the breaking up of a 
period into two or three subdivisions, and the fitting of 
separate curves to each. This results from changing condi- 
tions and sharply changing rates of growth or decline. 
Where such changes occur the movii^ average has the 
merit of flexible adaptation to the new conditions and is 
often a more effective measure of seciilax trend than curves 
fitted with great labor. 

Simple and weighted moving averages, in varying com- 
binations, have wide uses in the analysis of economic time 
series. An illuminating discussion of these Toses, and of the 
procedures appropriate to different purposes, is to be found 
in The Smoothing of Time Series, by Frederick R. Macaulay.^ 

Repebsbntation op Secttlak Trend by Mathematical 

Curves 

For many types of data the secular trend may be repre- 
sented by a mathematical curve rather than by a line 
based upon a moving average. Thus, if the growth (or 
decline) is by constant absolute increments (or decrements) 
a straight line will serve as an exact representation of the 
trend. Or the growth may be by constant percentages, 
as in the case of capital increase, when a principal sum 
increases in accordance with the compound interest law. 

‘ National Bureau of Economic Besearch, New York, 1931. < 



MATHEMATICAL CURVES 


S4d 


A curve of a definite mathematical form furnishes the best 
representation of this trend. In many series of economic 
statistics the data seem to conform to definite laws of 
growth, or decline, and where this is the case the task of 
analysis, interpretation, and projection is materially assisted 
by securing a mathematical expression for the underlying 
law. In practically all cases, of course, there are departures 
from this law, deviations above and below the line of secular 
trend. These deviations, however, do not destroy the value 
of an equation that describes the underlying law of develop- 
ment. 

There is one fimdamental difference between the moving 
average as a measure of trend and such mathematical 
ciuves. The former implies no definite “law” to which 
the data are assumed to conform. It is based upon the 
data as given; if the general trend changes, the moving 
average follows the new trend. It is a flexible measure of 
trend, adapting itself to changing conditions, purporting 
to be nothing more than an empirical approximation to 
the drift of the series. Mathematical ciuves fitted to eco- 
nomic series are, in fact, nothing more than empirical 
approximations also, but in a somewhat different sense. 
They assume a “law” of change underlying the variations, 
accidental and otherwise, which show upon the surface of 
the data. It is an empirical law which is assumed, it is 
true, but nevertheless there is postulated a uniform and 
consistent trend capable of mathematical expression. If 
such an assumption is to have any validity it is essential 
that the period during which the law is supposed to hold 
be homogeneous, that there be no material changes in 
the conditions affecting the series being studied. Thus 
an equation is secured for the trend of gold production, 
let us say. If a radical change should take place in methods 
of extraction the trend of gold production would diange 
materially and the former equation would no longer apply. 
Data covering the period before and after such a change 
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would not be homogeneous, and a single equation for the 
trend during the whole period should not be secured. 

In the practical approach to a problem involving the 
determination of secular trend the first task is the selection 
of the appropriate type of curve. This is perhaps the most 
difficult part of the work; certainly it is the part in which 
the element of personal judgment enters most directly. 
For there is no objective rule to follow, no fixed standard 
by which the most appropriate curve may be selected. 
Something more will be said on this subject after the 
characteristics of the chief types of curves and the methods 
of fitting them have been described. For the present it 
may be assumed that a curve similar to one of the types 
described in Chapter II, or to a related form, has been 
selected, and that we face the practical task of fitting it to 
the data. 

FITTING A STBAIGHT LINE; THE METHOD OF LEAST SQUARES 

If the data, when plotted, show a trend that can best 
be represented by a straight line the task of fitting is 
merely the determination of the constants in an equation 
of the form y = a + hx. The values of a and h which 
will give a line following most closely the trend of the 
data are to be obtained. A simple illustration may serve 
to demonstrate the various methods which may be employed. 
Nine points (1, 3; 2, 4; 3, 6; 4, 5; 5, 10; 6, 9; 7, 10; 8, 12; 
9, 11) are plotted in Fig. 55. Our problem is the fitting of 
a straight line to these points. 

By inspection approximate values of o and 6 may be 
determined. A thread may be stretched through the 
points in such a direction that it seems to follow the trend 
as closely as possible. The slope of the line thus laid out 
may be measured, the y-intercept determined, and the 
desired equation thus approximated. Obviously this is a 
loose and uncertain method, and the results obtained by 
different individuals may be expected to vary rather widely. 
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There is one and only one straight line that fits the plotted 
data most accurately. The constants for this line of best 
fit may be determined by the method of least squares. 

The theory upon which the method of least squares is 
based need not be detailed at length here. The argument 
may be briefly presented: A number of observation values 
of a certain quantity are found, and it is desired to obtain 
the most probable value of the quantity which is being 



Fio. 55. — Dlustrating the Fitting of a Straight Line to Nine Points 


measured. It is capable of demonstration that the most 
probable value of the quantity is that value for which the 
sum of the squares of the residuals is a Tninimum. (The 
“residual” is a term for the difference between a given 
estimated value and an observation value.) This is true 
of the arithmetic mean of the observation values. Thus, 
if a given distance be measured by- a number of individuals, 
with varying results, the most probable value is the arith- 
metic mean of the different measurements. The process 
of computing the mean involves the following steps, which 
are enumerated for the purpose of simplifying the later 
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explanation. We seek a result, a statement of the most 
probable value of the distance being measured, which will 
take the form: 

M = (a constant). 

Let us say we have three approximations to this value : 

M= 5,672 feet 
Af= 5,671 feet 
M= 5,676 feet 
adding, ZM = 17,019 feet. 

Since there is but one unknown, M, it may be derived 
directly from this equation, and we have 

M= 5,673 feet. 

This is the value for which the sum of the squares of the 
deviations is a minimum. 

A similar problem arises when the relation between two 
variables is being measured. Oxir goal in this case is the 
equation that correctly describes this relationship. We 
have secured, however, varying results which do not agree 
precisely as to the constants in the equation of relationship. 
In other words, our plotted points do not all lie on the 
same line. What are the most probable values of the con- 
stants in the required equation? The answer is analogous 
to that ^ven when a single quantity was being measured. 
We seek the constants which, when the resulting equation 
is plotted, will give a line from which the deviations of 
the separate points, when squared and totaled, will be a 
Tninimum. Assuming that each pair of measurements gives 
an approximation to the true relationship between the 
variables, we wish to find the most probable relationship, 
and this is given by the line for which the sum of the squared 
deviations is a minimum.^ 

We have, in the present example, nine pairs of values for 
X and y. Substituting these values in the generalized form 

^ Gf . Appendix A for a more detailed discussion of the method of least squares, 
together with a description of certain checks upon the calculations. 
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of the linear equation, y = a + &a:, we secure the following 
observation equations: 

3 = 0 -|“ 16 

4 = 0 “I" 26 

6 = a + 36 

5 = a + 46 
10 = a + 66 

9 = fl “t" 66 

10 = a + 76 
12 = a + 86 

11 = a + 96. 

Any two of these equations could be solved as simultaneous 
equations, and values of a and 6 secured. But these values 
would not satisfy the remaining equations. Our problem 
is to combine the nine observation equations so as to secure 
two normal equations, which, when solved simultaneously, 
will give the most probable values of a and 6. The first 
of these normal equations is secured by multiplying each of 
the observation equations by the coefficient of the first 
unknown (a) in that equation, and adding the equations 
obtained in this way. Since the coefficient of a m the present 
case is 1 throughout, the nine observation equations are 
unchanged by the process of multiplication. The second 
of the normal equations is secured by multiplsdng each 
of the observation equations by the coefficient of the second 
unknown (6) in that equation, and adding the equations 
obtained. Thus the first equation is multiplied throughout 
by 1, the second by 2, and so on. The process of securing 
the two normal equations is illustrated in Table 64 on 
page 250. 

The two normal equations are 

70= 9a + 456 
418 = 45a + 2856. 

It remains to solve these equations for a and 6. By multi- 
plying the first equation by 5 and subtracting it from the 

second, a may be eliminated; a value of or 1.133, is 
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Table 64 

Derwation of Normal Equations from Observation Equations 

3 == a + 16 3 = lo + 16 

4=0+26 8=2o+46 

6 = o + 36 18 = 3o + 06 

6 s» o + 46 20 = 4o + 166 

10 = 0 + 56 50 = 5o+ 266 

9 = 0 + 66 64 = 6o+366 

10 = o + 76 70 = 7a + 496 

12 = 0 + 86 96 = 80 + 646 

11 = o + 96 99 = 9fl + 816 

70 = 9o + 456 418 = 46a + 2866 

found for 6. Substituting this value in either of the equa^ 
tions, a value of 2.111 is secured for a. The equation to 
the best fitting straight fine is, therefore, 

y = 2.111 + 1.133®. 

In the actual application of the method it is not necessary 
to write out and total the equations, as is done above. 
We need only insert the proper values in the two equations,^ 

S(y) = no + 6S(®) 

’Lipy) = a2(®) + 62(®*). 

The symbols employed have the following meanings: 

S(y): the sum of the values of y. 

S(®): the sum of the values of x. 

2(xy): the sum of the products of the paired x’s and y’s. 

S(®*): the sum of the squares of the values of x. 

n: the number of pairs of values; the number of points 
plotted. 

The work of computation is facilitated by a tabular 
arrangement similar to that shown in Table 65. 

The two desired normal equations axe secured by sub- 
stituting these five values in the type equations given 
above. It will be noted that the results are identical with 
those obtained from the observation equations. 

1 General rules for the fonoation of normal equations are given in Ap- 
pendix A. 
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Table 65 

Cotnpuiation of Vtduea Required in Fitting a Straight Line 


X 

y 

xy 



1 

3 

3 

1 

n = 9 

2 

4 

8 

4 

S(a:) = 45 

3 

6 

18 

9 

S(y) = 70 

4 

5 

20 

16 

= 285 

5 

10 

50 

25 

= 418 

6 

9 

54 

36 


7 

10 

70 

49 


8 

12 

96 

64 


9 

11 

99 

81 


45 

70 

418 

285 



When the equation to the best fitting straight line has 
been obtained the values of y corresponding to given values 
of X may be computed and compared with the observed 
values. Table 66 presents the results seemed: 


Table 66 


Comparison of Observed and Computed Values of a Variable Quantity^ 


X 

V 

y 

d 


xd 


(observed) 

(computed) 




1 

3 

3.24 

- 2} 

.0697 

- .2} 

2 

4 

4.34 

- .34 

.1427 

- .74 

3 

6 

5.54 

+ 4f 

2390 

+ 1.44 

4 

5 

6.61 

- 1 6} 

2.7041 

- 6 64 

5 

10 

7.74 

+ 2.2! 

4 9381 

+ 11 14 

6 

9 

8.94 

+ .04 

.0079 

+ .6} 

7 

10 

10.0} 

- .0} 

.0020 

- .34 

8 

12 

11.14 

+ .81 

.6760 

+ 6.54 

9 

11 

12.34 

- 1.34 

1.7190 

- 11.8 




0.0 

10.4885 

0.0 


The sum of the deviations of the plotted points from the 
line is zero. The sum of the deviations when each is multi- 
plied by the correspondir^ value of x is also zero. The 
accuracy of the actual calculations involved in fitting may 

‘ The common fractions are retained in certain cohmms in order that the sum 
of the deviations may be exactly zero. 
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be tested in this way. The sum of the squares of the devia- 
tions, 10 . 4886, is a TniniTm im. Any change in the value of 
o or 6 would give a line from which the sum of the squared 
deviations would exceed 10.4885. 

FITTING A STRAIGHT LINE; SPECIAL CASES 

The simultaneous solution of the two normal equations 
will give, in any case, the most probable values of a and b. 
The processes of calculation may be simplified in certain 
special cases, not infrequently encountered in handling 
economic data. If the r’s are consecutive numbers, as they 
always are when an unbroken time series is plotted, the 
mgia may be taken at the median value. When the number 
of observations is odd this will be the middle item, of 
course. The value of 2 (sc) will then be zero, and the normal 
equations become 

S(jr) = na 
2(ry) = 62(3:2). 

Thus if a time series extends, by years, from 1901 to 1937, 
the origin may be taken at 1919, the value of x corresponding 
to 1918 being — 1, to 1920, -|- 1, and so on. The solution 
for values of a and b is rendered much easier when the 
data may be disposed in this way. When there is an even 
number of years the same process is possible, time (the 
x-variable) being measured in units of one half year. 

Again, when the values of x are consecutive positive 
numbers starting at zero, the values of 2(a:) and of S(a:®) 
may be easily determined. The sum of the first n natural 
ThOth I I') 

numbers is equal to g — ’ Thus the sum of the numbers 

5(5 + 1) 

from 1 to 5 is ^ — -> or 15. This term may replace S(a:) 

in the normal equations. Similarly, the sum of the squares 

of the first n natural numbers is equal to 

Thus the sum of the squares of the numbers from 1 to 5 
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is equal to ^ ^ ~ expression may replace 

2 (a;®) in the normal equations, and we have 

- 4c^) + 


It is sometimes easier to work from equations in this form 
than in the form first given. The data for time series may 
be handled in this way, the years being numbered consecu- 
tively, beginning with 1. 


PITTING A CUBVE OP THE POWER SERIES 

The discussion above has been confined to the case of 
linear trend. Such a type of curve frequently gives an 
excellent fit, but in many cases it fails accurately to fit the 
data. This diflSculty is sometimes overcome in practice 
by breaking a series into segments and fitting a separate 
line to the data for each of these periods. Where there is 
an actual break in the series, the period as a whole lacking 
homogeneity, this practice may be justified, but when the 
period is essentially homogeneous the whole concept of 
secular trend is violated by this process of subdividing 
and fitting separate lines. In many cases where a strai^t 
line win not fit, a curve of the power series may represent 
the trend accurately. The general process of fitting such a 
curve may be briefly described. 

The generalized form of the equation of the tjqie desired 
is y = a + bx + cx^ + dx^ + .... An equation of this 
form does not, of course, represent a curve of the parabolic 
type, but in ordinary usage that term is applied to the 
potential series. If carried to the second power of a; it is 
called a second degree parabola; if to the third power, a third 
degree parabola, etc. For ordinary purposes such a curve 
should not be carried beyond tifie second or third power of x. 
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If carried to the second power there are three lanknowns, and 
three normal equations must be solved simultaneously in 
securing the required values. 

The procedure is fliTnilar to that outlined for the linear 
case. Each observation equation is multiplied by the 
coefficient of the first unknown in that equation, and the 
resulting equations are totaled to give the first normal 
equation. The process is repeated for the two other un- 
knowns, and the three normal equations thus obtained are 
solved for a, h, and c. The results are the most probable 
values of these three constants. The following are the 
general forms which the three normal equations take: 

S( 2 /) = no + bS(x) + c2(x^). 

2(xi/) = aS(x) + 62(1*) + cS(a:»). 

2(x^) = aS(a:*) -f- b2(x^) + c2(x*). 

As an example of the process, the calculations involved 
in fitting a second degree parabola to the points 1, 2; 2, 6; 
3, 7; 4, 8; 5, 10; 6, 11; 7, 11; 8, 10; 9, 9 may be outlined. 
It is of the greatest practical importance in curve fitting, 
as in all extensive calculations, that the work be laid out 
and carried on in a definite and systematic fashion, with 
each step definitely related to the preceding and succeeding 
operations. Checks should be introduced wherever possible, 
as mathematical errors creep into even the most careful 
work. A tabular arranganent is generally advisable, each 
operation being revealed and each set of results clearly 
presented. The data in the present problem may be ar- 
ranged as in Table 67. 

When the x’a are consecutive integers beginning with 1, 
as in the present case, the values of 2(x), S(a:®), S(®®), 
and S(a:^) may be secured from prepared tables.^ 

* Gf._ Table XXVIII, Pearson, Tables for SUsSMdans and Biomeirieians. 
Gambiidge University Press; Tables D and E, Mills and Davenport, Manual 
of Problems and Tables in StaMsHcs, New York, Henry Holt and Co. Values 
to the third power are given in Appendix Table IX of the present volume. 
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Table 67 

Ctmpidatim of Vdhtes Required in Fitting a Second Degree Parabola 


X 

y 

xy 




1 

2 

2 

1 

2 

n = 9 

2 

6 

12 

4 

24 

S(a:) = 45 

3 

7 

21 

9 

63 

S(a:») = 285 

4 

8 

32 

16 

128 

S(a:’) = 2,025 

5 

10 

50 

25 

250 

S(x‘) =15,333 

6 

11 

66 

36 

396 

S(j/) = 74 

7 

11 

77 

49 

539 

l^ixy) = 421 

8 

10 

80 

64 

640 

S(a:*y) = 2,771 

9 

9 

81 

81 

729 


45 

74 

421 

285 

2,771 



Substituting these values in the equations given above, 
the following normal equations are secured: 

74 = 9a + 456 + 285c. 

421 = 45a + 2856 + 2,025c. 

2,771 = 285a + 2,0256 + 15,333c. 

When these equations are solved simultaneously the 
following values are secured for the three constants: 

a = — . 929. 

6 = +3.523. 
c = - .267. 

The equation of the desired curve is 

y = - .929 + 3.523a: - .267x*'. 

This curve and the nine given points are plotted in Fig. 56 
on page 256. 

If the values of a: are consecutive, as in the present 
example, the work of computation is lightened if the noid- 
value is taken as origin. In this case S(a:) and S(a;*) are 
equal to zero, and the normal equations become 

Sy = no + cS(a:*). 

Hiry) = 6S(a:*). 

= dS(p^) + cS(®«). 
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When a third degree parabola of the form y = o + 6x 
+ ca:* + dr® is to be fitted to data, four constants must be 
determined, and four normal equations are necessary. 
These are of the following form: 

S(j/) =na + hZix) + cS(a:®) + dS(x»). 

S(xy) = oS(x) + 62(x») + cS(x») + dS(x*). 

S(x®y) = a2(x®) + b2(x®) + c2(x^) + d2(x«). 

2 (x® 3 /) = o2(x») + 62(x0 + c2(x®) + d2(x»). 

The solution for four or more constants involves a con- 
siderable amount of arithmetical ealcidation, and there is 



Fio. 56. — niustrating the Fittmg of a Second Degree Curve to Nine 

Points 


some question as to the advisability of representing secular 
trend by equations of this type. With a sufiScient number 
of constants a curve may be fitted which will follow every 
variation in the data, but such a curve could hardly be taken 
to represent the long time trend.^ Minor departures from 

^Begarding the employment of potential series of the type indicated for 
representing empirical curves, Steinmetz states that their use is justified: 

1. If the successive coefficients a, b, c . . . decrease in value so rapidly that 
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a simple and uniform trend, linear or otherwise, are to be 
expected with economic data, but, if a real trend exists, 
extreme departures from a fairly simple form are rare. If 
such departmes are due to pronounced changes in conditions 
no single line of trend is likely to be satisfactory, and it is 
advisable to break the period into parts, with a separate 
line of trend for each part. “Empirical curves,” says 
Steinmetz, “can be represented by a single equation only 
when the physical conditions remain constant within the 
range of the observations.” Though this statement relates 
to the fitting of curves to data from the physical sciences, 
it applies equally well to economic data. 

Detehmestation op the Secotar Trend op a 
Business Series 

PITTING A STRAIGHT LINE 

The procedure of fitting certain types of curves to simple 
data has been illustrated in the preceding sections. Before 
proceeding to a discussion of sli^tly different forms, it 
will be helpful to examine concrete examples of trend 
determination. We first determine the secular trend of 
a series defining the number of concerns in business in the 
United States, during the period from 1899 to 1914.^ The 
observations are given in Table 68, together with the values 
required for the fitting of a straight line to the data, and 
the derived trend values. The values of x represent the 
time factor, while the values of y are the corresponding 
numbers of business concerns. Only the entries in columns 

(Footnote 1 continued from page ^56.) 

within the range of observation the higher terms become rapidly smaller 
and appear as mere secondary terms. 

2. If the successive coefficients follow a definite law, indicating a convergent 

series which represents some other function, as an exponential, trigono- 
metric, Qtc., 

3. If all the'coeffidents ajfe very small, with the exception of a few of them, and 

only the latter ones thus need to be consider^. Cf. C. P- Steinmetz, 
Engineering Mathernatica, New York, McGraw-Hill, 1917, 214-215. 

1 Data compiled by Dun and Bradstreet. 
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(2) to (5), it should be noted, are employed in the fitting 
process. 


Table 68 


Number of Concerns in Business in the UnUed States, 1899-1914 
Computation of values required in fitting Ime of trend 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Year 

X 

y 

ay 


Vc 





Trend values 



No. of concerns 



{linear) qf no. 



in business, 



of concerns 



in thousands 



in business, 
in Oumands 

1899 

1 

1,148 

1,148 

1 

1,152 

1900 

2 

1,174 

2,348 

4 

1,184 

1901 

3 

1,219 

3,657 

9 

1,217 

1902 

4 

1,253 

1,281 

5,012 

16 

1,250 

1903 

5 

6,405 

25 

1,283 

1904 

6 

1,320 

7,920 

36 

1,316 

1905 

7 

1,357 

9,499 

49 

1,349 

1906 

8 

1,393 

11,144 

64 

1,382 

1907 

9 

1,418 

12,762 

81 

1,415 

1908 

10 

1,448 

14,480 

100 

1,448 

1909 

11 

1,486 

16,346 

121 

1,481 

1910 

12 

1,515 

18,180 

144 

1,513 

1911 

13 

1,525 

19,825 

169 

1,546 

1912 

14 

1,564 

1,617 

21,896 

196 

1,579 

1913 

15 

24,255 

225 

1,612 

1914 

16 

1,655 

26,480 

256 

1,645 

Totals 

136 

22,373 

201,357 

1,496 



N 

= 16 

S(jf) 

= 22,373 



SW 

= 136 

2(12/) 

= 201,357 



S(x2) 

= 1,496 





The equations to be solved in determining the required 
constants are of the form 


S(y) = Na + 6S(a:) 

S(a:y) = aZ{x) + 62(a:*). 

Inserting the given values in the formulas, we have 

22,373 = 16o + 1366 
201,367 = 136o + 1,4966 
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from which 


a = 1,118.65 
h = 32.90. 

The equation to the line of best fit is therefore 
y = 1,118.65 + 32.90® 
with origin at 1898. 

The trend values derived from this equation appear in 
column (6) of Table 68. The original data and line of 
trend are plotted in Fig. 57. The fit for the period covered 



Fio. 57. — Number of Concerns in Business in the United States, 
1899-1914, with line of Trend 


is good. The nmnber of concerns in business in the United 
States during the axteen years before the World War is 
well defined by the strai^t line we have secured. 
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FITTING A POWER CURVE OP THE SECOND DEGREE 

The record of commercial failures in the United States 
over the last forty years provides an example of a series 
following a definitely non-linear trend. Data for the period 
1897-1933 are presented in Table 69, with accompanying 
computations. 


Table 69 


Commercial Failures in the United States, 1897-1933 ^ 
Computation of values required in fitting Ime of trend 


(1) 

(2) 

(3) 

(4) 

(5) 

Year 

X 

y 

(No.qf 

failwres) 

xy 

x*y 

1897 

-18 

13,351 

- 240,318 

4,325,724 

1898 

- 17 

12,186 

- 207,162 

3,621,764 

2,390,272 

1899 

- 16 

9,337 

- 149,392 

1900 

-15 

10,774 

- 161,610 

2,424,150 

1901 

- 14 

11,002 

- 154,028 

2,156,392 

1902 

- 13 

11,615 

- 150,995 

1,962,936 

1903 

- 12 

12,069 

- 144,828 

1,737,936 

1904 

- 11 

12,199 

- 134,189 

- 115,200 

1,476,079 

1905 

-10 

11,520 

1,152,000 

1906 

- 9 

10,682 

- 96,138 

866,242 

1907 

- 8 

11,725 

- 93,800 

750,400 

1908 

- 7 

16,690 

- 109,830 

768,810 

1909 

- 6 

12,924 

- 77,544 

465,264 

1910 

- 5 

12,662 

- 63,260 

316,300 

1911 

- 4 

13,441 

- 63,764 

216,056 

1912 

- 3 

15,452 

- 46,356 

- 32,074 

139,068 

1913 

- 2 

16,037 

64,148 

1914 

- 1 

18,280 

- 18,280 

18,280 

1916 

0 

22,166 

0 

0 

1916 

1 

16,993 

16,993 

16,993 

1917 

2 

13,855 

27,710 

55,420 

1918 

3 

9,982 

29,946 

89,838 

1919 

4 

6,451 

25,804 

103,216 

1920 

5 

8,881 

44,405 

222,025 

1921 

6 

19,652 

117,912 

707,472 

1922 

7 

23,676 

165,732 

1,160,124 

1923 

8 

18,718 

149,744 

1,197,952 

1924 

•Dun 

9 

and Bradstreet. 

20,616 

185,535 

1,669,815 
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Table 69 — Coniimied 


Commercial Failures in the United States, 1897-1933 


(1) 

(2) 

(3) 

(4) 

(5) 

Fear 

X 

y 

xy 


1925 

10 

21,214 

212,140 

2,121,400 

1926 

11 

21,773 

239,503 

2,634,533 

1927 

12 

23,146 

277,752 

3,333,024 

1928 

13 

23,842 

309,946 

4,029,298 

1929 

14 

22,909 

320,726 

4,490,164 

1930 

15 

26,355 

395,325 

5,929,875 

1931 

16 

28,285 

452,560 

7,240,960 

1932 

17 

31,822 

19,626 

540,974 

9,196,558 

1933 

18 

353,268 

6,358,824 

Totals 


610,887 

+ 1,817,207 

75,307,301 


N = Z7 
S(a:) = 0 
S(a:*) = 4,218 
Z(a:*) = 0 


2(x^) = 864,690 
S(i/) = 610,887 
S(a^) = 1,817,207 
S(a:V) = 75,307,301 


The origin is taken at the middle year to facilitate the 
calculations. The values of S(a:®) and S(a;*) may be secured 
from prepared tables, or from the formulas cited on page 254. 

The normal equations to be solved in fitting a second 
degree parabola, with the origin at the middle year of the 
period covered, are of the form 

S(y) = Na + cS(x‘) 

2(xy) = bS(a:*) 

S(a;*y) = aS(a:*) 4- cS(x*). 

Inserting the appropriate values, we have 

610,887 = 37a + 4,218c 
1,817,207 = 4,2186 
75,307,301 = 4,218o + 864,690c. 

Solving for the constants, 

a = 14,827.6 
6 = 439.82 
c = 14.762. 
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The required equation is 

V = 14,827.6 + 430.82® + 14.762®* 
with origin at 1915. 

The original observations and the line of secular trend 



Fio. 58. — Commercial Failures in the United States, 1897-1933, with 

line of Trend 


are plotted in Figure 58. Observations, trend values and 
deviations from trend are given in Table 70. 

Commercial failures reflect the major cycles in American 
business, but with movements that reverse those of most 
economic series. Failures are munerous in times of depres- 
sion, fewer in prosperity. The reader who will compare the 
deviations from trend shown in Table 70 with the dates 
of reference cycles given on an earlier page will note the 
general agreement. The sharp fall in business failures 
from 1932 to 1933 reflected, of course, the special conditions 
prevailing in the latter year. 
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Table 70 



Commercial Failures 

in the United States 

t, 1897-1933 


Actual Values, Ttend 

Values, and Deviations 

1 from Trend 


Number of 

Trend value, 

DeviaMonof 

Year 

amrn&rcM 

second degree 

adudlfrotn trend 


failures 

parabola 

value 

1807 

13,351 

11,855 73 

+ 1,495 27 

1808 

12,186 

11,769.88 

+ 416 12 

1899 

9,337 

11,713.55 

- 2,376.55 

1900 

10,774 

11,686.76 

- 912.75 

1901 

11,002 

11,689.47 

- 687 47 

1902 

11,615 

11,721.72 

- 106.72 

1903 

12,069 

11,783.49 

+ 285.51 

1904 

12,119 

11,874.78 

+ 324.22 

1005 

11,520 

11,995.60 

- 475 60 

1906 

10,682 

12,145.04 

- 1,463.94 

1907 

11,725 

12,325 81 

- 600.81 

1908 

16,690 

12,535.20 

+ 3,154.80 

1909 

12,924 

12,774.11 

+ 159.79 

1910 

12,652 

13,042.55 

- 390.55 

1911 

13,441 

13,340 51 

+ 100.49 

1912 

15,452 

13,668.00 

+ 1,784.00 

1913 

16,037 

14,025.01 

+ 2,011.99 

1914 

18,280 

14,411.54 

+ 3,868.47 

1915 

22,156 

14,827.60 

+ 7,328.40 

1916 

16,993 

15,273.18 

+ 1,719 82 

1917 

13,855 

16,748.29 

- 1,893.29 

1918 

9,982 

16,252 92 

- 6,270.92 

1919 

6,451 

16,787.07 

- 10,336.07 

1920 

8,881 

17,350.75 

- 8,469.75 

1921 

10,652 

17,943.95 

+ 1,708.05 

1922 

23,676 

18,566 68 

+ 6,109.32 

1923 

18,718 

19,218.03 

- 500.93 

1924 

20,615 

19,900.70 

+ 714 30 

1925 

21,214 

20,612.00 

+ 602.00 

1926 

21,773 

21,352.82 

+ 420.18 

1927 

23,146 

22,123.17 

+ 1,02J.83 

1928 

23,842 

22,923.04 

+ 918.96 

1929 

22,909 

23,752 43 

- 843.43 

1930 

26,355 

24,611.35 

+ 1,743.65 

1931 

28,285 

25,499.79 

+ 2,785.21 

1932 

31,822 

26,417.76 

+ 5,404.24 

1933 

19,626 

27,365.25 

- 7,739.26 
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The second degree curve employed to define the trend 
of commercial failures does so with reasonable accuracy 
over the period here covered. Extrapolation beyond those 
limits would be hazardous. Indeed, the changed conditions 
under which banking and many other types of business 
were conducted after 1933 may well break the continuity 
of the series, and generate a new long-term trend. 

The Use of Logarithms in Curve Fitting 

The family of curves described above represents a simple 
and very useful type. Perhaps of even greater general 
utility, in the analysis of time series, are curves of a semi- 
logarithmic type. The advantages of plotting many series 
of data on semi-logarithmic or “ratio” paper were explained 
in an earlier section. A fundamental virtue of this type 
of plotting is that it presents a true picture of relaiive 
variations, of ratios between magnitudes. Relations of 
this type are ordinarily of primary interest in the analysis 
of economic data, and it is logical that determination of 
trends should proceed on the same basis. 

In doing so, we can make use of a group of curves of the 
same general form as those already described, the one 
difference being that log y takes the place of y throughout. 
That is, the straight line form is log y = a -f hx, whfie the 
general form for the potential series is log y = a + bx + cx^ 
-t- dc® -h . • . • The curves secured may be constructed on 
arithmetic paper, plotting the natural x’s and the logarithms 
of the y’s, or natural values of both a:’s and y’s may be 
plotted on semi-logarithmic paper, the logarithmic scale 
extending along the y-axis. The latter is the simpler 
method. 

To illustrate the procedure, the steps involved in fitting 
a curve of the type log y = a + will be shown. The 
trend of petroleum production in the United States from 
1922 to 1929 is to be determined. The values needed in 
the normal equations are derived from Table 71. 
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Table 71 


Petroleum Production in the United States, 1922-1929 
(Computation of values required in fitting line of trend) 


Year 

X 

y 

log?/ 

X • logy 

1922 

1 

557.5 

2 74624 

2.74624 

1923 

2 

732 4 

2 86475 

5.72950 

1924 

3 

713 9 

2.85364 

8.56092 

1925 

4 

763 7 

2.88292 

11.53168 

1926 

5 

770 9 

2 88700 

14.43500 

1927 

6 

901 1 

2 95477 

17.72862 

1928 

7 

901 5 

2 95497 

20.68479 

1929 

8 

1,007.3 

3 00316 

24.02528 




23 14745 

105.44203 


N = S S(logy) = 23.14746 

S(a:) = 36 S(a: • log y) = 106.44203 

S(a:2) == 204 


The two normal equations to be solved are of the form 

S(log y) = Na + hSx 
S(x • log y) — dSx + hSxK 

Substituting the given values we have 

23.14746 = 8a + 366 
106.44203 = 36a + 2046. 

Solving for the constants, 

a = 2.75646 
6 = .03044. 

The equation to the desired curve is, therefore, 
logj/ = 2.76645 + .03044a: 
with origm at 1921. 

In fitting this curve by the method of least sqxiares, as 
is done above, we satisfy the condition that the sum of 
the squares of the logarithmic deviations shall be a Tninimum. 
That is, the deviations to which this condition relates are 
the differences .between the logarithms of the observed 
values and the logarithms of the corresponding trend values. 
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This curve, it should be noted, is not the same as that 
from which the sum of the squares of the arithmetic (natural) 
deviations is a minimuTn. 

The substitution in the above equation of the value of 
X representing any given year will enable the logarithm 
of the trend or normal value to be calculated. The trend 
value in natural numbers may then be determined. In 
Table 72 the normal value for each of the years covered 
is given, together with the percentage relation of actual to 
normal. 


Table 72 


Trend of Petroleum Production in the United States, 1922-1929, 
ioith Comparison of Actual and Trend Valiies 
(Straight line trend determined from logarithms of production figures) 


Year 

X 

y {actual) 
Production 
{in millions 
ofm.) 

hgvc 

Log of trend 

Vc 

{y, computed) 
Trend value 
{in miUions 
of ms.) 

Percentage reton 
tumaf axtudt 
to trend 

1922 

1 

557.5 

2 78689 

612 2 

91.1 

1923 

2 

732.4 

2 81733 

656.6 

in 5 

1924 

3 

713.9 

2.84777 

704.3 

101.4 

1926 

4 

703 7 

2 87821 

765 5 

101 1 

1926 

5 

770 9 

2 90865 

810.3 

95.1 

1927 

6 

901 1 

2 93909 

869.1 

103.7 

1928 

7 

901.5 

2 96953 

932 2 

96.7 

1929 

8 

1,007.3 

2 99997 

999.9 

100.7 


The points representing the actual production, together 
with the line of trend, are plotted in Fig. 59. The graph 
of the derived equation gives a good representation of 
the trend in the present instance. 

An eqtiation of this type, defining a linear trend in the 
logarithms of the dependent variable, has certain dis- 
tinctive advantages. The reader will note that this is the 
logarithmic form of an equation to a compound interest 
curve (an exponential cm^re). This equation was given 
in Chapter II as 
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y = p(H- r)* 
or 

log p = log p + log (1 + r)x. 

In the example just given we have used the symbol a for 
log p and the symbol 6 for log (1 + r), but the equations 
are identical. 


Millions 
of Barrels 



Fig. 59. — Production of Petroleum in the United States, 1922-1929, 
with Line Defining Average Rate of Growth 


We may readily change to natural numbers the constants 
in the equation defining the trend of petroleum production 
from 1922 to 1929. We have 

logp = 2.75645 + .03044X 

where 2.76645 is log p and .03044 is log (1 + r). The 
natural number corresponding to 2.75645 is 570.8; the 
natural number corresponding to .03044 is 1.0726. The 
trend of petroleum production in natural form is, therefore, 
y = 570.8(1.0726)* 

with origin at 1921. 

Subtracting 1 from the constant 1.0726 we secure .0726, 
which is r, the rate of increase of a series growing in accord- 
ance with the compound interest law. (If, on subtracting 
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1, we have a negative value, the growth is negative, of 
course.) This measure indicates that the production of 
crude petroleum increased at an average rate of 7.26 per 
cent a year between 1922 and 1929 (r being multiplied by 
100 to place it on a percentage basis). 

When the trend of a series in time may be defined by a 
straight line on ratio paper, and it is surprising how widely 



Fig. 60. — Production of Crude Petroleum in the United States, 
1918-1936, with line of Trend 


applicable such a function is, the constant r is a highly 
useful measure. It defines the average annual rate of 
growth or declme of the series. It is, of course, an abstract 
measure and thus has the great merit of permitting com- 
parison of the trends of series relating to widely different 
original units. The rate of growth of population, over a 
given period, may have been 1.4 per cent per year; the 
production of gasoline may have increased at a rate of 
4.5 per cent, the production of automobiles at 4.2 per 
cent, the production of wheat at 1 . 1 per cent, total national 
income at 1.6 per cent, total national debt at 3.2 per cent. 
The trends of these series are immediately comparable, and 
conclusions concerning the direction and character of a na- 
tion’s development may be drawn. This measure provides a 
valuable device for the study of social and economic change.^ 

^lu any axteoEive applicatioii of this procedure time and labor may be 
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Table 73 

Production of Crude Petroleum in the United States, 1918-1936 
Comparison of Adual and Trend Values 
(Trend values determined from second degree parabola fitted to loganthms of 
production %ures) 



y {actual) 

y {computed) 

Percentage 

Year 

Production 

Trend valve 

relation of 

{in millions 

{in millions 

actucd to 


ofhbU.) 

ofbbU.) 

trend 

1918 

335 9 

345 0 

97.4 

1919 

378 4 

395 5 

95.7 

1920 

442.9 

449 2 

98.6 

1921 

472 2 

505 2 

93.5 

1922 

557.5 

562 8 

99.1 

1923 

732.4 

620.8 

118.0 

1924 

713 9 

678.2 

105.3 

1925 

763 7 

733.8 

104.1 

1926 

770 9 

786 3 

98.0 

1927 

901 1 

834.3 

108.0 

1928 

901 5 

876.8 

102.8 

1929 

1,007 3 

912 4 

110.4 

1930 

898.0 

940.5 

95 5 

1931 

850.3 

960.0 

88.6 

1932 

785 2 

970 4 

80.9 

1933 

905 7 

971.5 

93.2 

1934 

908.1 

963 2 

94.3 

1935 

996 6 

945.7 

105.4 

1936 

1,098 5 

919.6 

119.5 


By the use of additional terms a ftmction of the type just 
discussed may be modified, when dealing with a series 
marked by non-linear trends on ratio paper. For example, if 
the course of petroleum production be followed over a longer 
period, as is done in Fig. 60, it is obvious that the trend 
line secured for the period 1922-1929 is inappropriate. The 
addition of a third constant gives an equation of the type 
log y = o -h Z»x -t- ca*. 

(Footnote 1 condnuedfrom page 268,) 

saved by utilizing Glover's mean value table (cf. James W. Glover, Tables of 
Applied Mathematics, George Wabr, Ann Arbor, Michigan, 1923, 468ff.)- 
By the use of this table the compound interest curve may be fitted directly 
to the natural numbers. All necessary computations are simply and quickly 
performed. 
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In fitting this to the data of petroleum production for 
the period 1918-1936, we may follow the exact procedure 
used when y was the dependent variable in a similar equa- 
tion (see page 261), except that log y is used throughout, 
instead of y. For the required equation we have 

logy = 2.921331 + .023660a: - .002107a;* 

with origin at 1927. This is shown graphically in Fig. 60. 
Actual and trend values, in natural terms, are given in 
Table 73 on page 269. 

Other Curve Types 

The two families of curves described in the preceding 
sections meet most of the needs of the economic statistician. 
The trend in most time series may be described by curves 
of the power series, fitted either to natural numbers or 
to the logarithms of the data (that is, to the logarithms 
of the y values; time, the avvariable, is treated in terms of 
natural numbers in fitting both the above types of curves). 
These classes constitute flexible and widely applicable curve 
forms. ^ Attention may be called to several other curve 
types which have been applied less extensively to time 
series, but with favorable results in particular cases. 

Curves of the ordinary parabolic type (y = ax^) are not 

1 There are avaUable for fitting higher degree curves of the power series 
methods that lessen the labor involved, particidarly if curves of different degree 
are to be fitted to the same data. These methods, which reduce the fitting 
process to a series of simple adding machine operations, are appropriate to 
extended research projects. Their use is not advisable, however, unless work 
involving a considerable number of routine operations is contemplated. It is 
desirable that the student master the basic least squares procedures outlined 
in the preceding pages, utilizing other methods only in case extended computing 
tasks are undertaken. 

For accounts of systematic methods suited to extensive calculations, see 
R. A. Fisher, StaUaHcal Methods for Research Workers, Edinburgh, Oliver 
and Boyd, SiJrth edition, 1936, 148-156; Max Sasuly, TreJid Analysis of Statis- 
tics: Theory and Technique, Washington, Brookings Institute, 1934. The ap- 
plication of the method of orthogonal polynomial described by Fisher is ad- 
mirably exemplified in James W. Ang^, The Behavior of Money, New York, 
McGraw-HiU, 1936, 195-202. 
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generally applicable to economic data in the form of timft 
series, as their use involves the treatment of the 
variable as a geometric series. Such a curve, it will be 
recalled, becomes a straight line on double logarithmic 
paper. Yet if a curve of this form serves accurately to 
describe the trend of a given series, its use is justified, 
empirically. 

Such curves may be fitted most readily by emplosdng 
logarithms and using an equation of the linear type. The 
equation 

y — oa:* 

becomes, in logarithmic form, 

log 2/ = log a + 6 log X. 

The two normal equations needed in fitting such a curve 
are of the form 

S(log j/) = n log o 4- 6S(log x) 

S(log X ■ log y) = log aS(log x) + 6S(log x^. 

By substituting the values computed from the data, these 
equations may be solved for log a and b, just as in fitting 
an ordinary straight line.^ 

The equation to the simple exponential curve may be 
written 


y = or*. 

(The r in this equation is the equivalent of 1 + r, as given 
on p. 267.) This equation may be used to define the trend 
of a series increasing or decreasing in geometric progression. 
It has been observed that the trends of economic series 
frequently depart from such a geometric progression by 
constant magnitudes. By adding this magnitude, in a 
given case, to the original series (or subtracting it), a 

^ A useful table of the sums of the logarithms of the natural numbers from 
1 to 100 is included as an appendix to Medical Biometry and Statistics, by 
Raymond Pearl, Philadelphia, Saunders, 1023. 
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modified series with a clear exponential trend may be 
secured. The trend of the original series may be written 

y = K -\- ar * 

where K is the constant magnitude by which the series 
departs from a geometric progression. A modified exponential 
curve of this type may give a highly satisfactory representa- 
tion of trend, in certain cases. The method employed in 
fitting such a curve is discussed in Appendix D. 

Some use has been made, in the interpretation of eco- 
nomic statistics, of the Gompertz curve, the equation to 
which was originally developed m the actuarial field. The 
equation is 

y = 06'®. 

Its use in the anals^sis of econonoic statistics has been based 
upon the argument that there is a general law of growth 
characteristic of population increase, and that this same 
type of growth is found in industries whose products are a 
direct function of the growth of population. 

A somewhat similar curve of growth, the “logistic,” has 
been employed by Verhulst and more recently by Raymond 
Pearl and Lowell J. Reed in forecasting population growth. 
This curve has been found to describe the trends of cer- 
tain economic series. Examples of the procedmes employed 
in fitting Gompertz and logistic curves are given in Appen- 
dix D. 

The Determination op Monthly Trend Values 

The procedures so far described have dealt with ftTmuftl 
measurements only. Having fitted a line or curve to ftTumnl 
data it is frequently necessary to make a transition to 
monthly units. Problems involving such monthly measvue- 
ments are faced in the study of cyclical movements which 
are discussed in the next chapter. 

The constant a in the trend equation defines the trend 
value in the year taken as origin. If the annual data 
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employed in the fitting processes are averages of twelve 
monthly values (e.g., the average price of pig iron in 1937) 
the constant a measures the trend value for a month cen- 
tering at the middle of the year covered by the annual 
figures. If the annual data are aggregates of twelve monthly 
values (e.g., total production of pig iron in 1937) the constant 
a must be divided by 12 to obtain the trend value for the 
month centering at the middle of the year. 

If the trend be linear, the constant h in the equation 
y = a + hx defines the change due to trend over a twelve- 
month period. In interpolating for monthly trend values, 
the increment (or decrement) from month to month (e.g., 

from January to February of the year 1937) is if the 

annual data employed in the fitting process are averages 
of monthly values. The increment from month to month is 

if the annual data are aggregates of monthly values. 

The one further step needed is properly to center the 
monthly trend values. These should, of course, be centered 
at points of time corresponding to those to which the 
actual monthly data relate. In averaging, or aggregating, 
monthly data relating to the middle of each of the twelve 
months in a calendar year we secure a figure centering at 
July 1. The month centering at the middle of the year 
of origin thus centers at July 1. For comparison with actual 
monthly data, we desire trend values centering at July 15, 
August 15, etc. At the beginning, therefore, we must add 
to the trend value for the month centering at the middle 

of ilie year of origin ^that is, to a or to one-half of the 

month-to-month increment (or decrement) that we have ob- 
tained from b of the trend equation. This procedure gives us 
the trend value for the month centering at July 15. This value 
may be compared with the actual value recorded for that 
month. The addition to this of the month-to-month trend 
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increment (or decrement) gives trend values for all following 
months; subtraction gives trend values for all preceding 
months.^ 

The Selection op a Curve to Represent Trend 

Various types of curves which may be fitted to represent 
the trend of economic data over a period of time have been 
described. But which of these many types is to be selected 
in a given case? Which will give the best standard of 
normality for each of the years covered? Several references 
to this problem have been made in the preceding sections, 
but no general principles have been laid down. And, in 
fact, no general principles can be evoked to answer this 
fimdamental question. There is no absolute test of goodness 
of fit in such cases. It is largely a matter of personal judg- 
ment as to the type of curve which best represents the 
trend in a given instance, and experience must play a 
dominant part in such judgments. But certain general 
considerations are of assistance in selecting the appropriate 
type of curve. 

1. The first step in the selection of a curve type is the 
plotting of the data. When this has been done, it is fre- 
quently possible by inspection to determine the appropriate 
form. The data may be plotted in four different combinar 
tions, of which the first two are of chief importance in 
dealing with economic material. 

o. Natural x, natural y. (That is, plot the given figures on ordi- 
nary arithmetic paper.) 

b. Natural z, log y. (Plot the a:’s on the natural scale, and plot 
the j/’s on the logarithmic scale; i.e., use semi-logarithmic 
paper.) 

1 If the original monthly data relate to the first or last of the nomith, rather 
than the middle, a similwir correction is needed, but the monthly dates named 
in the terct would be different, of course. If the trend equation is non-linear, 
the process of interpolation must be correspondin^y modified. For a discus- 
sion of appropriate procedures the reader is referr^ to any treatise dealing 
with the general principles of interpolation. The Calculua of Observations, 
by Whitaker and Robinson, contains an excellent treatment of this topic. 
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c. Natural y, log x. (Plot on semi-logarithmic paper, with the 

a:-scale logarithmic.) 

d. Log y, log X. (Plot on paper with logarithmic ruling on both 

scales.) 

If in any of these cases a straight line trend is secured, 
a t37pe of equation which plots as a straight line under the 
given conditions (cf. Chapter II) would be selected. If a 
linear equation is not appropriate some other simple type 
may be suggested by the plotted data. In studying such 
graphs for the purpose of selecting a curve to represent 
trend, one sho\ild be familiar with the curves representing 
all the simpler equations. 

2. The appropriate curve may be determined by a study 
of the relations between the two variables, x and y. In 
the simpler cases the following relations hold: ^ 

a. If, when the values of x are arranged in an arithmetic series, 

the corresponding values of y form a geometric series, the 
relation is of the exponential type, described by the equa- 
tion 

y — oh®. 

b. If, when the values of a; are arranged in a geometric series, the 

corresponding values of y form a geometric series, the rela- 
tion is of the simple parabolic or h3rperbolic t3rpe, described 
by the equation 

y = axK 

c. If, when the values of x are arranged in an arithmetic series, 

the first differences of the corresponding ^’s are constant, the 
relation is of the straight line type, described by the equa- 
tion 

y = a + bx. 

The diSerenoes between succeative y values, when x’s are arranged in an 
arithmetio series, are termed “first differences" or “first order diff^mces” 
and are represented by the symbol Ay. The differences between successive 
first differences are called “second differences" and are represented by the 

‘ It will be recalled that an arithmetic series changes by a constant absolute 
increment, while a geometric series changes by a co n stant percentage. 
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symbol A®y. Differences of higher order are similarly derived. The foUowmg 
table illustrates the formation of differences: 


X y 
1 11 

2 40 

3 101 

4 206 

5 367 

6 596 

7 905 

8 1,306 

9 1,811 

10 2,432 


29 

61 

105 

161 

229 

309 

401 

505 

621 


32 

44 

56 

68 

80 

92 

104 

116 


A*i/ 


12 

12 

12 

12 

12 

12 

12 


d. If, when the values of x are arranged in an arithmetic series, 
the nth differences of the corresponding y's are constant, the 
relation between the variables is described by an equation of 
the potential series carried to the nth power of x] that is, by an 
equation of the type 


y = a + bx + + dc® + . • . + 

Thus, in the example given above, in which the third differ- 
ences are constant, the relation between x and y would be 
described by an equation of the form 


y = a + ba; + + dx^. 


When one is selecting a curve to use in the analysis of 
economic data, he will rarely, if ever, find these tests to 
be met perfectly. This would happen only when the curve 
chosen passed through all the plotted points. But data 
in a given case will generally approximate some one of 
the conditions described above, and the appropriate type 
of curve will be indicated. 

3. If the study of the original data does not render a 
definite decision possible, several types of curves may be 
fitted to the data and the decision made by comparing 
the results. If the equations to the curves being compared 
contain the same number of constants, a comparison of 
the root-mean-squaxe deviations about the curves furnishes 
a conclusive and valid test of the closeness of the fit within 
the limits of the data. 
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The root-mean-square deviation may be readily computed 
by making use of the following relationship 

S(d®) = S(y2) - aS( 2 /) - hi:{xy) - cL{x^) - . . . 

where 2(d®) is the sum of the squares of the deviations 
about the line of trend. (The derivation of this equation is 
explained in Appendix A, in which a generalized form is 
given.) If the equations do not contain an equal number 
of constants, a test of this sort is invalid and the comparison 
can only be made by inspection. Personal judgment as 
to the curve which represents the trend most accurately 
must be the basis of the decision in such cases. 

It should be remembered that the closeness of fit within 
the limits of the data is not of itself a final criterion. An 
equation could be secured, having a number of constants 
equal to the number of points, which would give a curve 
passing through every point plotted, yet such a curve 
would not necessarily represent the trend. The concept 
of a trend is of a regular, smooth underlying movement, 
from which there are deviations, but which marks the long- 
time tendency of the series. In general, therefore, the curve 
should be of simple form, if it is to be consistent with the 
concept of secular trend. This does not mean, however, 
that a complex trend can be represented by a simple curve 
which fails to conform to the plotted data. 

4. An important question to be answered before the 
form of curve can be selected relates to the limits within 
which the line of trend is to be used. If it is to be used only 
within the limits of the plotted data (i.e., for interpolcdion) 
one set of considerations governs the choice of a curve. If 
it is to be projected beyond the limits of the data, used as 
a basis for the determination of normal during a subsequent 
■period, other considerations enter. In the former case a 
reasonable fit to the data is the sole requirement; in the 
latter case it is necessary, in addition, that the trend of the 
projection be logical, and consistent with the past record. 
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The fact should be clearly recognized that projection, or 
extrapolation, represents a guess, justified only on the 
assumption that a proper line of trend has been fitted and 
that the same conditions that affected the series in the 
past "will prevail in the future. A change in conditions, the 
introduction of new elements, renders the projection invalid. 
When dealing with economic statistics, moreover, it is 
ordinarily impossible to teU, except in retrospect, when a 
change has taken place. Conclusions drawn from the pro- 
jection of a line of trend are always subject to error, therefore. 
In practical statistical work such projections are made, 
and are justified on the ground that the most probable 
course in the future is that which prevailed in the past. 
Projections into the distant future are, of course, subject 
to wider margms of error than short-time projections, 
lines of trend should be revised from time to time, there- 
fore, as new data become available. 

When a projection is to be made, a simple curve with few 
constants is to be preferred to a more complicated one. A 
third or fourth degree parabola may give an excellent fit 
to the data m a given case, but the projection of such curves 
is inadvisable. It is well to remember, as Perrin has pointed 
out, that a curve suitable for interpolation may not be at 
all adapted to extrapolation. 

The avoidance of distortion of trend lines by abnormal 
conditions in the terminal years of the period studied is 
particularly important when a trend is to be projected. 
Reference is made to this point in the next chapter. 

It seems to be true, in general, that simple curves fitted 
to the logarithms of the j/’s give more reliable results when 
projected than curves fitted to the natural numbers. In 
an interesting discussion of this point, Karl G. Karsten^ 
argues that phenomena characterized by a urdform rate 
of change are more likely to maintain their trend than 
phenomena marked by a uniform amount of change. It is 

1 Earl Eaxsten, Charts and Graphs, New York, Prentice Hall, 1923, 428-425. 
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the semi-logarithmic curves, of course, which best measure 
rates of change. 

5. It is frequently true that no one curve will fit a given 
series during the entire period it is desired to study. This 
may be due to changes in conditions which cause the trend 
to be altered. Thus the trend of wholesale prices was 
downward, in a direction well represented by a straight 
line, from the close of the Civil War to 1896. From 1896 
to the beginning of the World War the trend was upward, 
and could be described by a second degree parabola. From 
1921 to 1929 the trend was also curvilinear, rising to 1925, 
declining thereafter. Similar changes occur in many eco- 
nomic series. By breaking the entire period up into sections, 
appropriate lines of trend may be fitted to the several 
periods thus marked off. This process may be carried to 
a quite illogical extreme, however. The concept of trend 
is of a gradual, long-term change, and the breaking up 
of a series in order to fit a number of trend lines is contrary 
to the whole conception. It may be justified upon occasion 
when a real change in conditions occurs, but in all cases 
the attempt should be made to represent the trend during 
the whole period by a single line. 

Deflation as a Step in Analysis 

Many series of economic data are expressed in monetary 
units, in dollars, pounds, or francs. Such series are subject 
to distortion because of changes in the price level. Thus 
the value of heavy engineering contracts awarded in the 
United States in 1913 amounted to approximately 601 
millions of dollars; in 1929 the value of engineering contracts 
awarded in the same territory amounted to approximately 
3,950 millions of dollars.^ Was the volume of engineering 
construction in 1929 over six times that in 1913? It was 
not. The value of construction contracts awarded in any 
year depends not only upon the actual volume of construe- 

^ Figures compiled by Engineering News Record, 
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tion but also upon the costs of construction materials and 
labor, and these costs increased substantially from 1913 
to 1929. If we wish to measure the change in the volume 
of construction alone, these values must be corrected for 
the increase in construction costs between 1913 and 1929. 
Such a process is termed deflation.^ 

The selection of an appropriate deflating index is the 
central problem in such cases. For the present purpose we 



Fig. 61. — Comparison of Actual and Deflated Values of Contracts 
Awarded in Engineering Construction, 1913-1936 


may use an index of constructive costs, based upon the 
prices of steel, cement, and lumber, and upon wage rates 
in construction industries, compiled by the Engineering 
News Record. This index shows that construction costs 
in 1929 were approximately 107 per cent higher than in 

1 The term defiatUm is not inappropriate when correction is being made for 
an advance in prices; it is less suitable when correction is made for a fall in 
prices. The period sdected as a standard of reference may be one in which 
prices were r^tivdy high; division by a price or cost index resting on such a 
year as base will raise values relating to other periods. The word defloHon 
is a convenient one to use for this general process, however. In lining it in 
this somewhat technical sense we must understand it to mean eorreetion for 
dumges in the voiue o/ the ddUar (as measured by specific indices of prices or 
costs). 
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Table 74 


Achidl and Deflated Values of Contrads Awarded in Engineering 
Construdion, 1913-1936 



Conbrads awarded, 
engineering con- 

Index of 

Deflated vcdue of 
conbrads awarded 

Year 

struction (monthly 

construdion 

(monthly average, 


aeerage, in thou- 

costs ‘ 

in thousands of 

1913 

sands of dollars) ^ 
50,117 

1 000 

dollars) 

50,117 

1914 

48,574 

886 

54,824 

1915 

48,740 

.926 

52,636 

1916 

77,778 

1 296 

60,014 

1917 

61,592 

1 812 

33,991 

1918 

82,729 

1 892 

43,726 

1919 

97,991 

1.984 

49,391 

1920 

126,923 

2.513 

50,507 

1921 

99,459 

2 018 

49,286 

1922 

129,716 

1.745 

74,336 

1923 

158,670 

2.141 

74,110 

1924 

166,593 

2.154 

77,341 

1926 

213,287 

2.067 

103,187 

1926 

237,820 

2.080 

114,337 

1927 

271,147 

2 062 

131,497 

1928 

298,215 

329,193 

2.068 

144,205 

1929 

2.070 

159,030 

1930 

264,438 

2.029 

130,329 

1931 

202,693 

1 814 

111,738 

1932 

101,609 

1.570 

64,719 

1933 

89,031 

1.702 

52,310 

57,235 

1934 

113,383 

1 981 

1935 

132,513 

1.952 

67,886 

1936 

198,904 

2.065 

96,322 


1913 (the index is 100 for 1913, 207 for 1929). Dividing 
the 1929 aggregate by 2.07, to correct for the change in 
costs, we secure a deflated total of 1,908 millions of dollars. 
This 'may be taken to measure the aggregate value of 
engineering contracts awarded in 1929, when the 1913 
dollar is used as a standard of value. (In this process the 
value of money is assumed to be held constant with ref- 

^ Data on contracts awarded have been compiled by the Erigineering News 
Record; the index of construction costs haa been computed by the same agency. 
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erence to the year which is the base of the price or cost 
index used as a deflator.) If the deflating index may be 
accepted as an accurate measure of changing costs, the 
deflated series may be assumed to define changes in the 
actual volume of engineering construction. The effects of 
changing prices and wages will have been eliminated. 

The general procedme is illustrated in greater detail in 
Table 74 on page 281. Actual and deflated series are plotted 
in Fig. 61. The degree to which changing monetary values 
distorted the construction series may be readily appreciated 
from the diagram. 

Most value series are affected by price changes, and it is 
generally advisable to correct for this factor before further 
analysis is attempted. Each case presents a new problem, 
for no general deflating index is suitable to all series. The 
index of wholesale prices compiled by the United States 
Bureau of Labor Statistics has been used extensively in 
deflating economic data expressed in dollar values, but this 
index is not at all appropriate in many of the cases in 
which it has been employed. It is absurd, for instance, to 
deflate money wages by an index of wholesale prices. The 
deflating index employed should be a measure of price 
changes as they affect the series being deflated. 

The deflation of a value series is in general a first step 
in the study of that series. The way is then open for further 
analysis by methods explained in the present and succeeding 
chapters. 
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CHAPTER VIII 


THE ANALYSIS OF TIME SERIES: MEASUREMENT 
OF SEASONAL AND CYCLICAL FLUCTUATIONS 

The measurement of secular trend is but one of the 
problems connected with the analysis of a series in time. 
Such series, it has been pointed out, are subject to periodic 
fluctuations, seasonal and cyclical in character, and these 
fluctuations are generally more important in their effects 
upon business than is the long-time trend. Our present 
concern is with methods of isolatin g such periodic variations . 
The series, in Table 75, w^Ich ' clearly reflects the seasonal 
and cyclical swings of domestic trade in the United States, 
may be used to illustrate methods of measuring these 
movements. 


Table 75 


Average Weeldy Freight Gar Loadings in the United States, 
1918-1927 ‘ 

(Unit: 1,000 cars) 


Month 

1918 

1919 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

January 

655 

728 

820 

706 

696 

848 

859 

891 

920 

944 

February 

753 

687 

776 

685 

757 

854 

908 

906 

932 

956 

March 

842 

696 

848 

691 

818 

916 

916 

926 

960 

998 

April 

873 

721 

730 

706 

716 

941 

874 

932 

966 

969 

May 

897 

759 

862 

760 

776 

975 

895 

971 

1,018 

1,004 

June 

918 

796 

896 

762 

831 

1,012 

906 

992 

1,052 

1,021 

July 

970 

887 

901 

750 

813 

985 

881 

975 

1,037 

978 

August 

962 

892 

969 

810 

853 

1,042 

969 

1,073 

1,106 

1,073 

September 

956 

960 

967 

842 

925 

1,037 

1,037 

1,074 

1,107 

1,140 

1,184 

1,093 

October 

925 

967 

1,005 

932 

978 

1,070 

1,091 

976 

1,101 

November 

819 

807 

884 

764 

957 

964 

1,024 

1,042 

926 

December 

719 

758 

755 

681 

832 

827 

869 

925 

858 

814 

Average 

857 

805 

868 

757 

829 

956 

932 

983 

1,018 

990 


1 Data from the Annual Bulletin of the American Railway Association and 
the Survey of Current Business. The published figures have been dightly re- 
vised, to take account of calendar variations. 
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For the present purpose the study of seasonal and cyclical 
variations in freight car loadings is limited to the period 
1918-1927. The disturbances of the enaning period, com- 
bined with changes in railroad operating methods and busi- 
ness practices, materially modified the behavior of this series. 
The demonstration of statistical procedure will be clearer 
if restricted to the relatively homogeneous period here cov- 
ered. 

The Meastjbement of Seasonal Fluctuations: Moving 

Aveeages 

Moving averages provide a useful method of defining 
seasonal variations. Since these fluctuations take place 
within a constant period of twelve months, a moving average 
may be used with more confidence than when a cycle of 
varying length is involved. The magnitude of the fluctua- 
tions (the amplitude of the seasonal swings) will not ordinarily 
be constant, hence the line marked out by the moving 
averages will not be completely free of seasonal influences. 
The relation of the actual monthly items to the moving 
averages may be averaged, however, and the indices of 
seasonal variation based upon these averages. 



of averaging. For example, the weekly averages of freight 
car loadings relate to the middle of each month. The 
average of the twelve monthly items for 1918, when centered, 
falls on July 1st. The average of the items from February, 
1918, through January, 1919, centered, falls on August 1st. 
To secure a figure comparable with the July 16th average, 
these two must be averaged. By this process of computing 
a two-month moving average from the twelve-month aver- 
age, comparability with the origmal figures is secured. 
Table 76 presents averages obtained in this way for the. 
period from July, 1918, to June, 1927. 
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Table 76 

Moving Averages of Freight Car Loadings, 1918-1927 

(12-month moving average, centered, adjusted by 2-month moving average, 

centered) 

(Unit: 1,000 cars) 


Mtmih 

1918 

1919 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

Jan 


808.0 

850 

8 

809 6 

783 7 

915 8 

935.9 

957 3 

1.004 

8 

1.019 1 

Feb. 


SOI 7 

854 

6 

796 7 

788 1 

930 9 

928 5 

965 6 

1.008 

7 

1.015 3 

Mar. 


798 9 

858 

1 

784 9 

793 4 

943 4 

925 5 

971.5 

1.012 

8 

1.012 0 

Apr. 


800 8 

860 

0 

776 8 

798 8 

951 9 

926 4 

973 7 

1.018 

8 

1.006 5 

May 


802 1 

864. 

.8 

768 8 

808.7 

956 0 

927 8 

976.3 

1.022 

8 

998 3 

June 


803 2 

867, 

.9 

760.5 

823 0 

956.1 

930 0 

980.7 

1.020 

7 

991.6 

July 

860 5 

808 7 

863 

0 

757 0 

835.7 

956 4 

933.1 

984 2 

1.018 

9 


Aug. 

860.8 

816 2 

854 

5 

759 6 

846 0 

959.1 

934 3 

986 5 

1.020 

9 


Sept. 

851 9 

826 3 

844 

1 

767.9 

854.2 

961 3 

934 7 

989 0 

1.023 

5 


Oct. 

839.5 

833.0 

836 

6 

773 6 

867.6 

958 5 

937 5 

991 8 

1.025. 

,2 


Nov. 

827.4 

837.6 

831 

3 

774.7 

885 3 

952 4 

943 1 

995 2 

1.024 

8 


Deo 

816.6 

846 1 

821 

5 

778 2 

901.1 

944.7 

949.8 

999.7 

1.022. 

,9 



The original data are now expressed as percentages of 
the corresponding moving averages. These percentages 
are given in Table 77. 


Table 77 


Percentage Relation of Adual Freight Car Loadings to 12-Month 
Moving Averages 


Month 

1918 

1919 

1920 

1921 

1922 

1923 

1924 

1925 

1926 

1927 

J fiiZl* 


90.1 

96 4 

87.2 

88.8 

92.6 

91 8 

93.1 

91 6 

92 6 

Feb. 


85.7 

90.8 

86 0 

96.1 

91.7 

97.8 

93 8 

92 4 

94.2 

Mar. 


87.1 

98.8 

88.0 

103,1 

97.1 

99,0 

95.3 

94.8 

98.6 

Apr. 


90.0 

84.9 

90 9 

89.6 

98.9 

94.3 

95 7 

94.8 

96.3 

May 


94.6 

99,7 

98.9 

96.0 

102 0 

96 5 

99,5 

99.5 

100 6 

June 


99 1 

103 2 

100 2 

101 0 

105.8 

97.4 

101 2 

103.1 

103.0 

July 

112.7 

109.7 

104.4 

99 1 

97.3 

103 0 

94.4 

99.1 

101.8 


Aug. 

111.8 

109.3 

113.4 

106 6 

100.8 

108.6 

103.7 

108 8 

108.3 


Sept. 

112 2 

116 2 

114 6 

109.6 

108.3 

107.9 

110.9 

108 6 

111.4 


Oct. 

110 2 

116.1 

120.1 

120.5 

112 7 

111 6 

116.4 

111.6 

115.5 


Nov. 

99.0 

96 3 

106 3 

98.6 

108 1 

101.2 

103.5 

102 9 

101.7 


Deo. 

88 0 

89 6 

91.9 

87.5 

92.3 

87 5 

91 5 

92.5 

83.9 



These percentages show some variation from year to 
year in the relation of the figures for a given month to 
the moving average. Thus the January figures, while 
always below the average, vary from 87.2 per cent to 
96.4 per cent of the average. The nine percentages secured 
for each month must be averaged to obtain the index 
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desired. Either the arithmetic average or the median may- 
be employed for this pmpose. The results secured by 
applying the two methods are sho-wn in Table 78. In 
colxmms (2) and (3) the actual arithmetic means and 
medians are given. The average of the twelve arithmetic 
means happens to be exactly 100, so no further adjustment 
is needed. Usually the average will depart in some degree 
from 100, as it does for the medians. When this is the 
case, the twelve monthly index niunbers must be adjTJsted 
to make their average equal to 100. The items in column 
(4) have been secured from the items in column (3) by 
di-viding throughout by 1 . 00367. 

Table 78 

Indices of Seasonal Variation in Freight Car Loadings, Computed 
from Moving Averages 


(1) 

.(2) 

(3) 

(4) 

Month 

Arithmetic 

Medians 

Medians 

means 

(■unadjusted) 

(adjusted) 

January 

91 6 

91 8 

91.6 

February 

92 1 

92.4 

92.1 

March 

95 8 

97.1 

96.7 

April 

92.8 

94 3 

94.0 

May 

98.6 

99 5 

99.1 

June 

101.6 

101 2 

100.8 

July 

102.4 

101 8 

101.4 

August 

107.9 

108.6 

108 2 

September 

111 1 

no 9 

110.5 

October 

115.0 

116 5 

115.1 

November 

101.7 

101 7 

101.3 

December 

89.4 

89.6 

89 3 

Average 

100.0 

100.367 

100.0 


The Computation of Index Numbers op Seasonal 
Variation bt Averaging Ratios to Trend 

A somewhat similar method of securing seasonal indices, 
which has certain distinctive advantages, involves the 
averaging of ratios to trend. ^ In the application of this 
1 The wMantin.lH of this method were worked out iudependeatly by Helen D. 
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method, a suitable line of trend, linear or non-linear, is 
fitted to the data, the actual monthly items are expressed 
as percentages of the corresponding trend figures, and then, 
for each month, an average of the percentage ratios of 
the actual to the trend values is secured. This procedure 
is identical with that described in connection with the 
use of moving averages, except that the actual values may 
be expressed as percentages of normal values derived from 
any fimction employed to represent trend. In the selection 
of an average value for each month, use may be made of 
a multiple frequency table in obtaining an understanding 
of the nature of the actual seasonal movement. With the 
help of such a table the existence of a definite seasonal 
movement may be verified and the type of average to be 
used in seeurmg a typical value for each month may be 
determined. (It would, of course, be equally appropriate 
to use a table of this type in connection with the method 
of moving averages.) We shall apply this method to the 
data employed in the preceding examples. 

A straight line, fitted to annual averages of the data of 
freight car loadings from 1918 to 1927, as given in Table 75, 
is described by the equation 

y = 769.00 H- 23.727a: 

with origin at July 1, 1917. Normal values for each month 
may be computed readily.^ The normal value for the 
month centering at July 1, 1917, is 769 . 00 (i.e., the constant 
a of the trend equation). Since the increment over a twelve- 
month period is 23.727, the increment from month to 
month is one twelfth of this, or, 1.977. Hence the normal 
value for the month centering at January 1, 1918, is 769 . 00 

(Footnote 1 continued from jtage 287.) 

FaUmOT “The Measiirenieiit of Seasonal Variation,” Journal of the American 
Statistical Association, June, 1924, 167-179, and lincoln W. Hall, “Seasonal 
Variation as a Belative of ^ular Trend,” Journal of the American Statistical 
Association, June, 1924, 166-166. 

^ Methods used in the determination of monthly trend values are discussed 
in Chapter VII. 
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+ (6 X 1.977), or 780.862. But the average weekly freight 
car loadings for January, 1918, must be taken to center 
at January 15th. The monthly trend value centering at 
that date is 780.862 + Kl.977), or 781.850. The trend 
value for February, 1918, is secured by adding to 781.850 
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Fig. 62 . — Frequency Distributions: Monthly Frei^t Car Loadings 
Expressed as Relatives of Corresponding Trend Values 


the monthly increment, 1.977. A similar process gives 
the value for each succeeding month. The results, rounded 
oflE to the nearest whole number, are given in column (2) 
of Table 80. 

E^ressing each of the given values for each month as a per- 
centage of the corresponding trend value, we secure ten 
such relative figures (since the data cover ten years). The 
ten January percentages vary from 79.4 to 98.9, the ten 
October percentages from 107.0 to 119 . 7, etc. The multiple 
frequency table which appears in Pig. 62 is constructed 
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by classifying, in the form of a frequency distribution, the 
items for each month. 

The presence of a distinct seasonal variation is dem- 
onstrated by this table. Freight traffic is consistently 
low in the winter months. Activity is somewhat greater 
in the spring, and reaches a peak as a result of harvesting 
and other demands in the late summer and faU. 

The tabular summary facilitates the selection of a type 
of average for the measurement of the seasonal movements. 
The median is likely to be unrepresentative; it is subject 
to material change in value by the addition or withdrawal 
of one or two entries, unless there is a definite concentration 
in the monthly frequency distributions. The arithmetic 
mean of all the items, on the other hand, may be unduly 
affected by exceptional cases. An alternative method is 
provided by the possibility of taking the arithmetic mean 
of the central items for each month. If an inspection of 
the multiple frequency table does not lead to an immediate 
decision as to which is the best type of average to employ 
in a given case, several index numbers may be worked 
out for each month, and a decision reached after a compari- 
son of the results. (Indeed, since the determination of a 
typical value is a separate problem for each month, the 
method of averaging employed might vary from month 
to month.) In the present instance the seasonal variation 
is fairly regular, year after year. No great differences 
would appear in the results secured by averaging varying 
numbers of items. Index numbers based upon averages 
of the four central items for each of the twelve months 
are appropriate in this case. (In general, an average of 
three, four, or five central values is more likely to be stable 
and representative than either the median or an average 
of all the items for each month. The greater the concentra- 
tion in the monthly frequency tables, the smaller the number 
of items upon which the index numbers may be based.) 

The simple averages of the four central items constitute 
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the unadjusted index numbers given in Table 79. Correcting 
these so that the average of each group is equal to 100, 
we secure the adjusted index numbers presented in the 
same table. (These averages have been derived, not from 
the frequency distributions shown in Fig. 62, but from 
individual percentages defining the relation of actual to 
trend values.) 

Table 79 

Indices of Seasonal Variation in Freight Car Loadings, Based 
upon Percentage Ratios of Actual Values to Linear 



Trend Valuee 

Unadjusted 

Adjusted 


index numbers 

index numbers 

Month 

(Jbased upon four 

(baeed upon four 


central items) 

central items) 

January 

92 9 

91 6 

February 

94 8 

93 5 

March 

98 6 

97.2 

April 

94.3 

93 0 

May 

100 2 

98 8 

June 

102.4 

101.0 

July 

102 8 

101 3 

August 

109.7 

108 2 

September 

112 3 

110 7 

October 

115 6 

114.0 

November 

103.6 

102.1 

December 

89.9 

88.6 

Average 

101 425 

100.0 

The index numbers 

of seasonal variation 

derived from 


ratios to trend accord very closely with those computed 
from moving averages. The widest discrepancy, for the 
month of February, amounts to only 1.4. The consistency 
of the seasonal movement in freight car loadings helps 
to explain this close agreement. In general the two methods 
here exemplified will yield results that are fairly close 
together. Both are well adapted to the measurement of 
seasonal changes in homogeneous series. Simpler methods 
may be used on occasion, and more involved methods may 
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be required in dftflling with non-homogeneous series where 
there is reason to suspect that the pattern of seasonal 
movements has been modified during the period under 
observation. 

Modifications of these general procedures are necessary 
when the pattern of the seasonal movements in a given 
series is altered during the period under observation. Two 
types of shifts in seasonal variation may be distinguished. 
The first includes shifts that are irregular over time, but 
that are related to definable causal factors. Thus the price 
of an agricultural product may follow one seasonal pattern 
in years of hi^ production, and quite a different pattern 
in years of low production.^ Where this condition prevails 
it may be possible to compute two sets of seasonal indices, 
each to be applied imder appropriate conditions. Methods 
already described may be used in the construction of such 
indices. Of this irregular type, also, are alterations in 
the seasonal pattern of an economic series that reflect sharp 
changes in business practice. Shifts in the dates of the 
annual automobile shows in the United States have mater- 
ially altered the seasonal index of automobile sales. 

The second type of seasonal modification is progressive 
in character. The change in pattern is not sudden, nor 
does it reverse itsdf. Slow alterations over time in trade 
practices and consumption habits bring such evolutionary 
or secular changes. The slow displacement of the open 
car by the closed car brought such a progressive modification 
in the seasonal pattern of automobile sales. In the computa- 
tion of seasonal indices under these conditions persistent 
changes over time in the figures for each month may be 
measured separately. Thus, when ratios to trend have 
been obtained, all the January items (such as those plotted 
in Fig. 62) may be plotted chronologically. The progressive 
change in the January relatives from 1920 to 1937, say, 
is then d^ed by a line of secular trend. The trend value 

* See F. L. Thompson, Agricu&urdl Prices, Now York, McGraw-Hill, 1936. 
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for January of 1920 is a first approximation to the January 
seasonal index for 1920. The figure for February of 1920 
is obtained m the same way, and so for each month of 
1920. Adjustment of these preliminary values to make 
their average equal to 100 ^ves a set of seasonal indices 
for 1920. Seasonal indices for other years are computed 
in the same way. 

This method is, of comae, more laborious than the pro- 
cedure followed when the seasonal pattern remains con- 
stant. Before applying the more complicated method the 
investigator should assure himself that the shift in pat- 
tern is real, and not merely a reflection of accidental varia- 
tions.^ 

The Measurement op Cyclical Fluctuations 

There remains the task of combining the corrections for 
sectdar trend and seasonal variation in order to secure 
measures of cyclical changes in a given series. Major 
interest in most economic studies attaches to these cyclical 
changes, and the measurement of such changes is usually 
the central problem in the analysis of time series. The 
complete elimination of all non-cyclical movements is impos- 
sible, of course. We must content ourselves with measures 
reflecting cyclical changes intermingled in rather imcertain 
proportions with accidental fluctuations. 

The procedure may be illustrated with reference to the 
data of freight car loadings in the United States, presented 
in Table 75. For the purposes of the present illustration 
the study wUl be restricted to the decade 1918-1927. The 

* Tests of sampimg errors are discossed in Chapters XIV, XV, and XVIII. 
The test of a linear trend in this case would relate to the slope h of the line 
fitted to the relatives for a given montL 

The literature on the measurement of seasonal fluctuations is eztentive. 
The references at the dose of this chapter contain detailed accounts of various 
modifications of the basic procedures discussed above. A rapid, flexible 
and accurate graphic method, suitable for use by the student who has gra;qped 
the essentials of the formal procedures, is explained in the artide by 'William A. 
Spurr. Spurr’s method utilizes relative (logarithmic) deviations, a procedure 
for wbidi there is strong logical justifioation. 
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severe disturbances that occurred during the business cycle 
that ran its course between 1927 and 1933, and in the years 
immediately following, greatly complicate the task of disen- 
tangling the secular, seasonal, and cyclical elements in the 
behavior of this series. Not imtil a somewhat longer period 
has intervened wiU it be possible to determine the contribu- 
tions a changing secular trend and changing seasonal move- 
ments may have made to the fluctuations in railway freight 
traffic during the decade 1927-1937. 

In attempting to separate the results of secular, seasonal, 
cyclical, and random movements in the behavior of time 
series, it is well to establish a series of “expected” values, 
representing results of the operation of regularly acting 
forces. Most regular and predictable of the forces affecting 
time series are those defined as secular and seasonal. The 
equation to the line of secular trend of freight car loadings 
provides a means of estimating annual and monthly values. 
These would be the “expected” values were the forces of 
trend alone in operation. But we know that a seasonal 
movement, regular enough for fairly exact measurement, is 
superimposed upon the trend. The combination of the 
results of these two forces provides a basic series of “expected 
values,” from which deviations due to the play of other 
forces may conveniently be measured. 

A process suitable to this purpose is illustrated in Table 80. 
In col. (2) we have the monthly trend values of freight 
car loadings, and in col. (3) index numbers of seasonal 
variation. The products of the two, constituting the series 
of “expected values,” are given in col. (4). Thus, for Janu- 
ary, 1918, the expected number of freight cars loaded is 
not 782, the trend value, but 782 X .916, the latter figure 
being the seasonal index for January. This correction 
gives an “expected” number of 716. Subtracting from the 
actual values in col. (5) the corresponding expected values, 
we obtain the measurements in col. (6). The 656 cars 
loaded in January, 1918, fell short by 61 of the “expected” 
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Table 80 


lUudratvng the Analysis of a Series in Time 
Freight Car Loadings, 1918-1927 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Year 

and 

month 

Trend 

value 

Seasonal 
index 
(as ratio) 

Trend 

corrected 

for 

seasonal 

Actual 

value 

Deviation of 
actual value 
from Hrend 
corrected for 
seasonal^ 


T 

S 

TS 

A 

A - 

■ TS 

1918 







Jan. 

782 

916- 

716 

655 

— 

61 

Feb. 

784 

935 

733 

753 

-f 

20 

Mar. 

786 

972 

764 

842 

+ 

78 

Apr. 

788 

930 

733 

873 

+ 

140 

May 

790 

988 

781 

897 

+ 

116 

June 

792 

1 010 

800 

918 

+ 

118 

July 

794 

1 013 

804 

970 

+ 

166 

Aug. 

796 

1 082 

861 

962 

+ 

101 

Sept. 

798 

1.107 

883 

956 

+ 

73 

Oct. 

800 

1 140 

912 

925 

+ 

13 

Nov. 

802 

1.021 

819 

819 


0 

Dec. 

804 

.886 

712 

719 

+ 

7 

1919 







Jan. 

806 

916 

738 

728 

— 

10 

Feb. 

808 

935 

755 

687 

— 

68 

Mar. 

810 

.972 

787 

696 

”” 

91 

Apr. 

812 

930 

755 

721 


34 

May 

813 

.988 

803 

759 


44 

June 

815 

1.010 

823 

796 

— 

27 

July 

817 

1 013 

828 

887 

+ 

59 

Aug. 

819 

1.082 

886 

892 

+ 

6 

Sept. 

821 

1.107 

909 

960 

+ 

51 

Oct. 

823 

1.140 

938 

967 

+ 

29 

Nov. 

825 

1.021 

842 

807 

— 

35 

Dec. 

827 

.886 

733 

758 

+ 

25 

1920 







Jan. 

829 

916 

759 

820 

+ 

61 

Feb. 

831 

.935 

777 

776 

— 

1 

Mar. 

833 

.972 

810 

848 

+ 

38 

Apr. 

835 

.930 

777 

730 

— 

47 

May 

837 

.988 

827 

862 

+ 

35 

June 

839 

1.010 

847 

896 


49 


(7) 

Percentage 
deviation of 
actual value 
from ^trend 
corrected for 
seasonal ’ 



- TS 


TS 


8 5 

+ 

2.7 

-f 10.2 

+ 19.1 

+ 14 9 

+ 14 8 

-f 20.6 

+ 11 7 

+ 

8.3 

+ 

1 4 


0 

+ 

0.98 


1 4 

— 

9.0 

— 

11.6 

— 

4.5 

— 

5.5 

— 

3.3 

+ 

7.1 

+ 

0.7 

+ 

5.6 

+ 

3.1 

— 

4.2 

+ 

3.4 

+ 

8.0 

— 

0.1 

+ 

4.7 

— 

6.0 

+ 

4.2 

+ 

5.8 
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Table 80 — Continued 
lUustrabing the Analysis of a Series in Time 


(1) 

(2) 

(3) 

(4) 


T 

S 

TS 

1920 




July 

841 

1.013 

852 

Aug. 

843 

1.082 

912 

Sept. 

845 

1 107 

935 

Oet. 

847 

1 140 

966 

Nov. 

849 

1 021 

867 

Dec. 

851 

.886 

754 

1921 




Jan. 

853 

.916 

781 

Feb. 

855 

.935 

799 

Mar. 

857 

.972 

833 

Apr. 

859 

930 

799 

May 

861 

988 

851 

June 

863- 

1 010 

872 

July 

865 

1 013 

876 

Aug. 

867 

1.082 

938 

Sept. 

869 

1 107 

962 

Oct. 

871 

1.140 

993 

Nov. 

873 

1.021 

891 

Dec. 

875 

.886 

775 

1922 




Jan. 

877 

.916 

803 

Feb. 

879 

.935 

822 

Mar. 

881 

.972 

856 

Apr. 

883 

930 

821 

May 

885 

988 

874 

June 

887 

1 010 

896 

July 

889 

1 013 

901 

Aug. 

891 

1.082 

964 

Sept- 

893 

1.107 

989 

Oct. 

895 

1.140 

1,020 

Nov. 

897 

1.021 

916 

Dec. 

899 

.886 

797 

1923 




Jan. 

900 

.916 

824 

Feb. 

902 

935 

843 

Mar. 

904 

.972 

879 

Apr. 

906 

.930 

843 

May 

908 

.988 

897 


(5) 

(6) 

(7) 

A - TS 

A 

A - TS 

TS 

901 

+ 49 

+ 58 

969 

+ 57 

+ 63 

967 

+ 32 

+ 34 

1,005 

+ 39 

+ 4.0 

884 

+ 17 

+ 20 

755 

+ 1 

+ 0 1 

706 

- 75 

- 9 6 

685 

- 114 

- 14.3 

691 

- 142 

- 17.0 

706 

- 93 

- 11.6 

760 

- 91 

- 10.7 

762 

- no 

- 12.6 

750 

- 126 

- 14 4 

810 

- 128 

- 13.6 

842 

- 120 

-12.5 

932 

- 61 

- 6.1 

764 

- 127 

- 14.3 

681 

- 94 

- 12.1 

696 

- 107 

- 13.3 

757 

- 65 

- 7.9 

818 

- 38 

- 4.4 

716 

- 105 

- 12 8 

776 

- 98 

- 11.2 

831 

- 65 

- 7 3 

813 

- 88 

- 9 8 

853 

- Ill 

- 11.5 

925 

- 64 

- 6 5 

978 

- 42 

- 4 1 

957 

+ 41 

+ 4.5 

832 

+ 35 

+ 4.4 

848 

+ 24 

+ 2.9 

854 

+ 11 

+ 1.3 

916 

+ 37 

+ 4.2 

941 

+ 98 

+ 11 6 

975 

+ 78 ■ 

+ 8.7 
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( 1 ) 


1923 
June 
July 
Aug. 
Sept. 
Oct. 
Nov. 
Dec. 

1924 
Jan. 
Feb. 
Mar. 
Apr. 
May 
June 
July 
Aug. 
Sept. 
Oct. 
Nov. 
Dec. 

1925 
Jan. 
Feb. 
Mar. 
Apr. 
May 
June 
July 
Aug. 
Sept. 
Oct. 
Nov. 
Dec. 

1926 
Jan. 
Feb. 
Mar. 
Apr. 
May 


Table 80 — Coniinued 

Illustrating the Analysis of a Series in Time 
( 2 ) ( 3 ) ( 4 ) ( 5 ) ( 6 ) ( 7 ) 

T 8 TS A A -TS 

TS 

910 1 010 919 1,012 + 93 + 10 1 

912 1.013 924 985 + 61 + 6 6 

914 1.082 989 1,042 + 53 + 5 4 

916 1.107 1,014 1,037 + 23 + 2 3 

918 1.140 1,047 1,070 + 23 + 2 2 

920 1 021 939 964 + 25 + 2 7 

922 886 817 827 + 10 + 1 2 

924 . 916 846 859 + 13 + 1 5 

926 . 935 866 908 + 42 + 4.8 

928 . 972 902 916 + 14 + 1.6 

930 930 865 874 + 9 + 1 0 

932 . 988 921 895 - 26 - 2.8 

934 1.010 943 906 - 37 - 3 9 

936 1.013 948 881 - 67 - 7 1 

938 1.082 1,015 969 - 46 - 4 5 

940 1.107 1,041 1,037 - 4 - 0 4 

942 1.140 1,074 1,091 + 17 + 1 6 

944 1.021 964 976 + 12 + 1.2 

946 . 886 838 869 + 31 + 3 7 

948 .916 868 891 + 23 + 2 6 

950 .935 888 906 + 18 + 2.0 

952 972 925 926 + 1 + 0.1 

954 930 887 932 + 45 + 5.1 

956 . 988 945 971 + 26 + 2.8 

958 1.010 968 992 + 24 + 2.5 

960 1.013 972 975 + 3 + 0 3 

962 1.082 1,041 1,073 + 32 + 3.1 

964 1.107 1,067 1,074 + 7 + 0.7 

966 1 140 1,101 1,107 + 6 + 0.5 

968 1 021 988 1,024 + 36 + 3 6 

970 886 859 925 + 66 + 7 7 

972 . 916 890 920 + 30 + 3.4 

974 935 911 932 + 21 + 2.3 

976 .972 949 960 + 11 + 1.2 

978 . 930 910 966 + 56 + 6.2 

980 988 968 1,018 + 50 + 5.2 
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Table 80 — Continued 


lUustraiing the Analysis of a Series in Time 


( 1 ) 

( 2 ) 

( 3 ) 

( 4 ) 

( 5 ) 

( 6 ) 

A 

( 7 ) 

- TS 


T 

S 

TS 

A 

A - 

- TS 


TS 

1926 






60 


6 0 

June 1 

982 

1.010 

992 

1,052 

+ 

+ 

July ; 

984 

1.013 

997 

1,037 

+ 

40 

+ 

4 0 

Aug. ; 

986 

1 082 

1,067 

1,106 

+ 

39 

+ 

3 7 

Sept. : 

987 

1 107 

1,093 

1,140 

+ 

47 

+ 

4 3 

Oct. ' 

989 

1 140 

1,127 

1,184 

+ 

57 

+ 

5.1 

Nov. 1 

; 991 

1 021 

1,012 

1,042 

+ 

30 

+ 

3.0 

Dec. 

993 

886 

880 

858 

— 

22 

— 

2 5 

1927 









Jan. 

995 

916 

911 

944 

+ 

33 

+ 

3 6 

Feb. 

997 

935 

932 

956 

+ 

24 

+ 

2.6 

Mar. 

999 

972 

971 

998 

+ 

27 

+ 

2 8 

Apr. 

1,001 

930 

931 

969 

+ 

38 

+ 

4 1 

May 

1,003 

988 

991 

1,004 

+ 

13 

+ 

1 3 

June 

1,005 

1 010 

1,015 

1,021 

+ 

6 

+ 

0 6 

July 

1,007 

1 013 

1,020 

978 

— 

42 

— 

4.1 

Aug. 

1,009 

1 082 

1,092 

1,073 

— 

19 

— 

1.7 

Sept. 

1,011 

1 107 

1,119 

1,093 

— 

26 

— 

2 3 

Oct. 

1,013 

1.140 

1,155 

1,101 

— 

54 

— 

4.7 

Nov. 

1,015 

1 021 

1,036 

926 

— 

no 

— 

10.6 

Dec. 

1,017 

.886 

901 

814 

— 

87 

— 

9 7 


number, 716. Such deviations of actual values from “trend 
corrected for seasonal” represent the combined influence 
of cyclical and accidental factors. These may be utilized 
in the absolute form given in col. (6), or may be expressed 
in percentage terms as in col. (7) of Table 80. 

The series defining trend values corrected for seasonal 
variations, which are ^ven in cols. (6) and (7) of Table 80, 
furnish the most satisfactory bases from which cycles in 
economic series may be measured. It is true that the 
“cycles” in cols. (6) and (7) are distorted by accidental 
fluctuations, but there is no simple means by which these 
may be eliminated. Recognizing their presence, the series 
may be put to fruitful use in the study of cyclical movements. ^ 

^ A series of “corrected deviations from trend” may be secured by subtract- 
ing the indices of seasonal variation from a series in which actual values are 
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The ajialysis of this series may be followed through 
graphically in Figs. 63 and 64 on page 300. The actual data of 
freight ear loadings, by months from 1918 to 1927, are plotted 
in Fig. 63, together with a straight Onfi of trend. In 
addition, a series of espected values (the figures in col. [4] 
of Table 80) is given for comparison with the actual. In 
this chart the seasonal pattern, shown by the dotted line, is 
superimposed upon the trend. Fig. 64 shows the deviations 
of actual from expected values, in percentage terms. These 
constitute the “cycles” in frei^t car loadings. As we have 
noted, random elements as well as cyclical fluctuations proper 
are present m these deviations. It would be possible, by 
using three- or five-month moviug averages on these devia- 
tions, or by other smoothing processes, to eliminate some of 
the effects of the accidental movements. But the random and 
the cyclical movements are so closely interwoven that the 
attempt at separation is not generally made. 

If cyclical changes in this series are to be compared with 
similar changes in other series, it is desirable to reduce the fig- 
ures to a form permitting such comparison. The percentage 
deviations might be much more violent in one series than in 
another, and without a common denominator comparison 
would be difficult. This common denominator is afforded by 
the standard deviation. The monthly or annual deviations 
may be expressed in terms of the standard deviation as the 
unit of measurement, if such comparison is to be made. 


(Footnote 1 continued from page B98,) 

given as percentages of corresponding trend values. That is, ^ — 5 may be 


A — T8 

employed, instead of — ^ 7 ^ — • This usage, which involves the assumptions 

that the “(^clical-accidental'’ composite and seasonal varialions both repre- 
sent deviations from trend as base and that their influences are ^ditive, i s 
not as strong, logically, as the method exemplified in the text. \ 'rrend lEd 
seasonal forces are the constant factors in the behavior of time series. In 
combination they may be thou^t of as providing the base from which i^di- 
cal and accident^ movements occur, as deviations. (This is a convenient, and 
perhaps not a faulty, conception. We do n<^^ ho\rover. of 

the true organic relations among the elements of time series.) 
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Fig. 64. — Cyclical and Accidental Fluctuations in Freight Car Loadings in the United States, 1918-1927 
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The process of analysis has now been completed. We 
have, for the given series, the equation to the liTtft of secular 
trend, and from this the normal or trend value at any 
given date may be computed. The seasonal variations have 
been measured, and indices of these variations computed. 
Finally, the cyclical fluctuations (plus the unmeasurable 
random and accidental changes) have been isolated. These 
measurements of cyclical fluctuations may be used in 
studying the sequences of change in different economic 
series during business cycles, in comparing economic series 
in respect of the amplitude or duration of their cyclical 
movements, and in various other ways in the analysis of 
business cycles and the planning of business operations. 
Some of these applications are discussed in later sections. 

General Considerations 

Certain considerations not specifically mentioned above 
should be borne in mind in subjecting time series to' the 
t 3 q)e of analysis described in this chapter. It is essential 
that the data employed be homogeneous, as regards sources, 
methods of quotation, coverage, etc. In addition, homo- 
geneity in the conditions underlying the behavior of the 
particular series which are the objects of study is assumed. 
Homogeneity, as the term is here used, may not be defined 
in absolute terms. New factors are constantly being inter- 
jected into economic and social life. Homogeneity cannot 
be talren to mean static conditions. ^e chang^ must 
be orderly and, as regards major movements, reasonably 
continuous if the kind of analysis here discussed is to yield 
results. Abrupt dislocations that suddenly alter prevailing 
trends and existing seasonal patterns break the necessary 
homogeneity of statistical series. If the forces that caused 
these dislocations persist, and operate in orderly fadiion, 
we mark a break in our series and subject the new period 
to analysis in its turn. 

For the determination of a line of trend and the calcula- 
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tion of indices of seasonal variation, data extending over 
as InTig a period as possible should be employed (subject 
to the preceding qualification regarding fu n da me ntal dis- 
continuities). Ten years may be suggested as a TniniTnuTn 
period, tho ugh a much longer term of years is desirable. 
If interest attaches to cycles of long duration, rather than 
to the short-period business cycles with which the preceding 
account is concerned, our concept of trend, as well as that 
of cycles, must be modified. The Tninimum time period 
suitable for study must be correspondingly lengthened. 

If a relatively short term of years is employed in the 
determination of trend, it is important that the terminal 
years be neither exceptionally hi^ nor exceptionally low, 
as a result of cyclical or accidental movements. In general, 
the cyclical movements in the terminal years should be 
in “symmetrical phases,” in Crum’s phrase. Thus a cyclical 
rise at the beginning of the period should be balanced by 
a cyclical decline at the end. 

It is logically improper to make correction for assumed 
seasonal movements in a time series unless the existence 
of true seasonal variations has been established. That is, 
a test should be applied to determine whether the observed 
departures of the various monthly indices from their aver- 
age value (100) are attributable to the play of chance, or 
whether a true seasonal pattern is present. The basis of 
such tests of significance is discussed in Chapters XIV 
and XVni, and a method appropriate to the present problem 
is developed in Chapter XV. 

In fitting a line of trend, computing indices of seasonal 
variation and deriving, finally, a set of residual figures 
which are taken to reflect the cyclical fluctuations in an 
economic series we are, of course, abstracting from reality. 
As in all such abstractions, caution is necessary. Assump- 
tions implicit in the various steps taken are likely to be 
forgotten. Thus the “ cycles ” plotted as deviations in Fig. 64 
are distorted not only by the random and irregular fluctuar 
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tions to which attention has already been called. To the 
extent that the trend is inadequately or inaccurately defined 
by the particular function used, residual errors are present 
in the deviations. To the extent that seasonal movements 
are inaccurately measured by the seMonal indices employed, 
other residual errors are present. And if the trend is pro- 
jected beyond the period covered by the fitting process, 
or if seasonal indices are used for periods not included in 
their calculation, new sources of possible error are intro- 
duced. The “cycles” that appear so definite and clear-cut 
in our tables may contain more than traces of many non- 
cyclical elements. It is often desirable to employ methods 
of analysis that carry us far from the original observations, 
but the dangers of misinterpretation and error are multiplied 
as we abstract from the reality of economic processes and 
business operations. 

The methods of time series analysis described in this 
and the preceding chapter are adapted to a variety of 
economic and business purposes. But they do not constitute 
the only means of attack, in dealing with series ordered 
in time. Special problems may necessitate the use of some- 
what more elaborate procedures.^ For some purposes simpler 
methods will sufGi^ For*'^^’ purposes it may be invalid 
to attempt to isolate and measure separately the influence 
of secular, seasonal, and cyclical forces. Economic science 
has yet to determine the precise nature of the interrelations 
among these categories of forces. In the light of this fact 
the discerning statistician will adapt his methods to the 
reqiiirements of individual problems, as they arise. 
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CHAPTER IX 


INDEX NUMBERS OP PHYSICAL VOLUME 

Comprehensive and accurate records of physical pro- 
duction are of central importance to business interests, to 
government, and to economists. The appraisal of the mar- 
ket and the intelligent planning of production programs 
require knowledge of past production trends and present 
conditions. The credit policies of banking authorities and 
monetary policies of federal agfencies are determined in 
good part with reference to the physical volume of goods 
being produced and marketed. The phases of business 
cycles are probably traced with more accuracy by produc- 
tion movements than by changes in any other economic 
element. The directions in which the productive efforts 
of an economy are being exerted are defined by records of 
the output of goods of different classes, such as capital 
goods and consumption goods. Changes in the comrse of 
years in the true standard of living of a nation must be 
measured in terms of the aggregate of physical goods pro- 
duced. 

The last twenty years have witnessed notable enlarge- 
ments of the scope and improvements in the accuracy of 
measurements of production in the United States. Efforts 
of federal agencies, private organizations, and trade associar 
tions have combined to provide materially better statistics 
of output in agriculture, mining, and manufacture. More 
recently records of the volume of trade have been broadened 
and made more accurate. There are important gaps stiQ, 
particularly as regards the output of finished, highly fabri- 
cated goo^ not easily enumerated in units of constant 
quality. But the statistics we have provide a full and 

SOS 
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reasonably accurate record of monthly and annual move- 
ments of production. 

Here, again, we face the problem of combining series 
relating to individual commodities. For scattered data on 
the output of oats, coal, gasoline, pig iron, automobiles, 
etc. do not define the general changes in output that are 
of interest to persons concerned with the larger aspects 
of economic change. He who would study the course of 
general production encoxmters a problem much like that 
presented to the student of general price movements. If 
the general trend of production is to be determined, or 
if the cyclical or seasonal swings of production are to be 
studied, the mass of individual fiigures must be reduced to 
the form of a single index, the significance of which may 
be easily comprehended. The present chapter deals with 
methods appropriate to the construction of such indices. 

Index Numbees of Peoduction Unadjusted foe Teend 
AND Seasonal Movements 

An immediate and obvious obstacle to the combination 
of measures of output for different industries arises from 
differences in the units employed. Since bushels, tons, and 
gallons may not be added directly, the simple aggregative 
type of index is ruled out. One method of overcoming this 
difficulty is to reduce to relative terms the several output 
series that are to be combined. A relative number measuring 
the output of petrolemn in 1936 as a percentage of output 
in 1922 may be averaged with similar relatives for bituminous 
and anthracite coal. The average may be a ample one, 
or the relatives for the several commodities may be weighted 
in proportion to the importance of the commodities in 
question. This procedure was illustrated in detail in the 
opening pages of Chapter VI. 

An alternative method is to employ an index of the 
weighted aggregative type, keeping quantities constant as 
between two periods being compared. In 1917, according 
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to the computations of the Price Section of the War Indus- 
tries Board, the total value of the output of 90 raw materials 
in the United States was 34,748 millions of dollars. This 
figure represents, of course, a value total of the type 
S(gi9i7Pi9i7) where qm? represents the quantity of a given 
raw material produced in 1917 and pmr represents the 
average price of that commodity in 1917. In 1918 both 
quantities and prices were different. If, however, we obtain 
another value aggregate using 1918 quantities and 1917 
prices we shall have a figure differing from that for 1917 
only in respect of the quantity factor. For the 90 raw mate- 
rials in question this total, which is represented by the 
expression S(5i9i8Pi9i7) amounted to $35,169,000,000. The 
totals for 1918 and 1917 are comparable, being both in 
dollar units. The difference between them measures the 
change in physical production between the two years. As 
an index of this change we have 

r _ _ $35,169 _ -q. „ 

S(gi9i7Pi9i7) $34,748 

This index will be recognized as one of the aggregative 
tyx>es discussed in Chapter YI, except that the p’s and the 
j’s are interchanged. When information concerning both 
quantities produced and average per unit prices is available, 
these aggregative indices, or the “ideal” index which is 
a combination of two such aggregative measures, may be 
employed for the measurement of quantity changes as well 
as for price changes. The “ideal” index, when used for 
this purpose, takes the form 

r ^ | / 2(glPo) y ^(glPl) 

r 2(ffoPo) S(g0pO 

where qo and po represent the quantities and prices of the 
individual commodities in the base 3 ^ear, while qi and pi 
represent quantities and prices in the given year. The 
procedure in the computation of such an index is identical 
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with that employed in computing the “ideal” price index, 
with prices and quantities reversed. This formula may be 
modified, as was the corresponding price index, to 

^(po "t" Pl)gl 
2(po + Pi)9o 

or to a form in which the p’s come from some intermediate 
year. In one form or another, the aggregative t3q)e of 



Fio. 65. — Changes in the Physical Volume of Manufacturing Production 
in the United States, 1914-1935. All Commodities, Capital Goods and 
Consumption Goods 

index is well adapted to the requirements of an index of 
physical volume.^ 

The aggregative procedure lends itself readily to the con- 

* Sin(se the price or value factor enters in the derivation of such an index, 
whether it be constructed from relative numbers or from value aggregates, 
no quantity index is completely divorced from pecuniary measurements. For 
a discussion of this point, and of other logical problems involved in the con- 
struction of index numbers of production, see Arthur F. Bums, ‘*The Measure- 
ment of the Physical Volume of Production,” Qmrterly Journal of Economics, 
February, 1930. 
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struction of index numbers for commodity groups. This 
is desirable in the study of production movements, as it 
is for prices. The significant features of production changes 
over a given period may be far more clearly revealed by 
measurements of relative changes in the output of different 
classes of goods than by a general index of production. 

Changes in the volume of production of various classes 
of manufactured goods during the period 1914-1935 are 
indicated by the following measurements, constructed by 
the National Bureau of Economic Research. Tbe basic 
data, which were compiled by the Census of Manufactures, 
provide the quantity and (by derivation) the unit price 
records required for the “ideal” formula. That formula, 
slightly modified for working purposes, was employed in 
the construction of these index numbers. 


Table 81 


Index Numbers of the Physical Volume of Production of 
Manufcustured Goods in the United States, 1914-1935 ^ 



AU 

industries 

DurrMe 

goods 

Semi- 

durable 

goods 

Perish- 

able 

goods 

1914 

100 0 

100.0 

100.0 

100 0 

1919 

129.5 

- 141 7 

120 9 

123 2 

1921 

104 5 

99 6 

104 6 

108 9 

1923 

155.8 

183.7 

140.2 

135 4 

1925 

159 5 

185.2 

141.8 

144.4 

1927 

163.3 

177.2 

161 0 

154 9 

1929 

183.7 

210.9 

162 5 

170 9 

1931 

138 2 

112.3 

137 4 

154 9 

1933 

128.0 

91.4 

140 1 

144.4 

1935 

160 5 

143.9 

164 4 

163,9 


Goods 
destined 
for human 
consump- 
tion 

100.0 
129 1 

109.4 

150.4 
154 0 

159.5 

177.7 
146 9 
142 6 
171 3 


Goods 

destined 

forcapiUd 

equipment 

andam- 

stnuMon 

100 . 0 . 
129 5 
91 8 
164.3 

167.7 

166.7 
192.0 

103.7 ■ 
81.3 ■ 

122.5 


Selected measurements from Table 81 are shown graphi- 
cally in Fig. 65. 

1 CioDstnicted by the National Bureau of Bconomic Beaearch, New York. 
Economic Tendencies in the United Stales for a statement on procedure. 
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Adjusted Index Numbers op Production 

In the analysis of time series we have seen that cyclical 
fluctuations are often the objects of primary interest. This 
is particularly true in the study of physical volume, for 
nhangaa in the volume of production and trade are features 
of fundamental importance in business cycles. Methods 
have been explained, in the preceding chapters, by means 
of which we seek to measme the cyclical fluctuations in 
individual series (fluctuations which are inextricably entan- 
gled with accidental movements of major and minor degree). 
An obvious next step, in the study of general business 
conditions, is the combination of the cyclical-accidental 
movements in a number of series into a single index. The 
utility of such an index of changes in the physical volume 
of production in the course of the business cycle is evident. 

When annual data are employed the construction of an 
index of these cyclical changes is simple. No problem of 
seasonal variation enters, and secular trend alone has to 
be taken accoimt of. Two different methods by which 
this may be done present themselves. Edmund E. Day, 
a pioneer in this field of economic research, has tested both 
methods. 

The first involves the fitting of an appropriate line of 
trend to each of the constituent series. The actual items 
are then expressed as percentages of the corresponding 
trend values. When this has been done for each series, 
the final adjusted index for a given year is obtained by 
t ak i n g a wei^ted average of these percentages for that 
year. Each commodity may be weiglited in this averaging 
process, as in the calculation of the unadjusted index. 
The resulting adjusted index is in terms of relatives, but 
these relatives refer to a hypothetical “normal,” instead 
of to any fixed base. This is the desired index of cyclical- 
accidental changes in the physical volume of production. 
With monthly data the process is the same, except that, 
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before being averaged, the deviations from tread are cor- 
rected to eliminate the influence of recmrent seasonal 
movements. 

In the process of averaging deviations from trend, account 
should be taken of the relative variability of the series 
being combined. As an example, we may consider the 
combination of data of pig iron production and cattle 
receipts in a general index of production. Reducing pig 
iron production to terms of “seasonably adjusted deviations 
from trend,” we obtain a series marked by rather extreme 
fluctuations. The standard deviation of this adjusted 
series, for a given period, was 27. For cattle receipts, cor- 
respondingly adjusted, the standard deviation was 11. In 
any combination of the two series of percentage deviations 
the more widely fluctuating pig iron measurements will exer- 
cise a dominant influence, unless correction is made. The use 
of weights defining the relative economic importance of the 
two series will not prevent distortion due to the greater 
variability of the pig iron series. 

One way out of the difficulty is to divide the deviations 
from trend by the respective standard deviations, before 
averaging. This gives an index in standard deviation units. 
Another procedure involves the combination of the “eco- 
nonoic weight” and the standard deviation of each series 
in a weighting factor to be applied directly to the percentage 
deviations from trend. The economic wei^t is divided 
by the corresponding standard deviation, in making the 
combination. The method is illustrated below. 


Series 

Pig iron production 
Cattle receipts 


Economic Standard 

weight denabum 

20 27 

4 11 


Econotmc weight 
4* standard dmation 

.747 

.363 


The final wei^ting factors are the figures given in the last 
column. These may, of course, be rounded off when a 
number of series are to be averaged.^ 


^ Tbis useful method of combining economic 'weigibts and standard deviar 
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The alternative method of combining economic series is 
simpler. Each unadjusted index possesses a trend which is 
“a composite of the persistent tendencies of the several 
original series upon which the unadjusted index is based.” 
It is possible to measure this trend, instead of the separate 
original trends, and secure the adjusted index directly from 
the unadjusted. Day’s results indicate that there is no 
loss of accuracy in the use of the simpler method. . 

AN INDEX OF INDTJSTEIAL ACTIVITY 

This procedure, with certain modifications, is well exempli- 
fied in an “Index of Industrial Activity in the United 
States,” constructed by the Chief Statistician’s Division 
of the American Telephone and Telegraph Company.^ The 
elements of this index are monthly data; seasonal corrections 
are therefore necessary. When these corrections have been 
made a general index measuring long-term growth and 
cyclical-accidental fluctuations, in combination, is con- 
structed by averaging 11 series, with appropriate weights.® 
This index is shown for the period 1899-1937, with line 
of trend, in Fig. 66. The trend line was fitted by least 
squares to data for the period 1899-1930, with the war 
years, 1917-1918, omitted. 

When each monthly value of the index is expressed as 
a percentage deviation from the corresponding trend value, 
the measurements presented in Table 82, and graphically 
portrayed in Fig. 67, are obtained. The cyclical-accidental 

tioBB has beea emidoyed by G. W. Starr, Director of the Bureau of Busiuess 
Research of Indiana University. I am indebted to him for the example. 

1 This index has been constructed for the use of the staffs of the Bell ssratem 
companies, and is not available for distribution. It is published here by cour- 
tesy of the American Telephone and Telegraph Company. 

* The following series were used for the later years of the period covered: 
steel ingot production, pig iron production, automobile passenger car produc- 
tion, building contracts awarded (on a square foot basis), cotton consumption, 
wool consumption, slaughter of cattle and hogs, newsprint consumption, mis- 
cellaneous freight car loadings, electric power consumption, and employment 
in manufacturing industries. Since employment is included, the index goes 
slightly beyond the field of strict physical production. It is intended to be 
an index of industrial activity. 
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fluctuations in indxistrial activity, as represented by the 
11 series employed, are traced by the movements of this 
index. 

Index op Industrial Production op the Board op 
Governors op the Federal Reserve System 

A comprehensive monthly index of production in mining 
and manufacturing industries of the United States is con- 
structed by the Division of Research and Statistics of the 
Board of Governors of the Federal Reserve System. This 
index is designed to serve current needs. In the selection 
of its components emphasis has been placed upon the 
promptness with which basic data on the output of industrial 
commodities become available, as well as upon their accuracy 
and representativeness. 

The chief points of general interest relating to this index 
may be briefly noted. 

Coverage. The index is derived from 60 individual series, 
measuring production in some 35 industries. Approximately 
80 per cent of the total indvatrial production of the United 
States is represented directly or indirectly in the index. 

Base period. The base of the published relatives is daily 
average production during the three years 1923, 1924, and 
1925. The final indices appear as relatives on this base, 
both with and without seasonal correction. 

Character of data vised. For each commodity production 
is computed in terms of average output per working day. 
By this method distortion due to changes from one month 
to the next in the number of Sundays and holidays included 
is avoided. 

Form of index number. The index is of the wei^ted 
aggregative type. Original quantity figures are multiplied 
by wei^ting factors which convert them into common 
umts (i.e., values, in dollars). In deriving the finnl index, 
the aggregate for a given date is expressed as a percentage 
of the base-period aggregate. 
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Weighting factors. For mineral products the weight for 
each commodity is its average per unit value in the base 
period. For manufactured products the weight for a given 
commodity is the per unit “value added” (i.e., added by 
manufacture), modified to the extent that the commodity 
in question is taken to represent other manufactured prod- 
ucts not directly included in the index. These “weights” 

thus correspond to p’s in the aggregative formula , 

2(goPo) 

except that for a manufactured product the p is a “price” 



Eig. 68. — Physical Volume of Industrial Production in the United States, 
1919-1937 (1923-1925 average = 100) 


for the services of agents of fabrication, with a modification 
to allow the given commodity to represent similar products 
for which quantity data are not available. The wei^ts 
for manufactured goods were drawn from the Census of 
Manufactures for 1923. The po used to weight the g for 
manufactures is thus not strictly a base-period price. ^ 

Adjustment for seasonal variation. No correction for trend 
is made, but in one form of the index an adjustment is 
made to eliminate the effect of seasonal fluctuations in the 

^Weighting factors were modified for the period 191^1922 by the com- 
bination of weights for 1919 with those for the base period. 
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Table 83 


Index of Industrial Produdion, Board of Governors of the Federal 
Reserve System, 1915^1937 


Month 

(Adjusted for seasonal variation. 
1919 1920 1921 1922 1923 

1923-1925 average 
1924 1925 1926 

100) 

1927 

1928 

Jan. 

82 

95 

67 

73 

99 

100 

105 

106 

107 

107 

Feb. 

79 

95 

66 

76 

100 

102 

104 

105 

108 

109 

March 

76 

93 

64 

80 

103 

100 

103 

106 

110 

108 

April 

78 

88 

64 

77 

106 

95 

102 

107 

108 

108 

May 

78 

90 

66 

81 

106 

89 

102 

106 

109 

108 

June 

83 

91 

65 

85 

106 

85 

102 

108 

107 

108 

July 

87 

89 

65 

85 

104 

84 

103 

108 

106 

109 

Aug. 

89 

89 

67 

83 

103 

89 

103 

no 

106 

no 

Sept. 

87 

86 

68 

88 

100 

94 

101 

111 

104 

113 

Oct. 

86 

83 

71 

93 

99 

95 

104 

111 

102 

115 

Nov. 

85 

76 

71 

97 

98 

97 

107 

no 

101 

117 

Dec. 

86 

72 

70 

100 

97 

101 

109 

107 

102 

118 

Annual 

index 

83 

87 

67 

85 

101 

95 

104 

108 

106 

111 

Month 

1929 

1930 

1931 

1932 1933 1934 

1935 

1936 

1937 

Jan. 

119 

106 

83 

72 


65 

78 

90 

97 

114 

Feb. 

118 

107 

86 

69 


63 

81 

89 

94 

116 

March 

118 

103 

87 

67 


59 

84 

88 

93 

118 

April 

121 

104 

88 

63 


66 

85 

86 

101 

118 

May 

122 

102 

87 

60 


78 

86 

85 

101 

118 

June 

125 

98 

83 

59 


91 

83 

87 

104 

114 

July 

124 

93 

82 

58 

100 

76 

86 

108 

114 

Aug. 

121 

90 

78 

60 


91 

73 

88 

108 

117 

Sept. 

121 

90 

76 

66 


84 

71 

91 

109 

111 

Oct. 

118 

88 

73 

67 


76 

73 

95 

110 

102 

Nov. 

no 

86 

73 

65 


72 

74 

96 

114 

88 

Dec. 

103 

84 

74 

66 


75 

86 

101 

121 

84 

A-nTinftl 

index 

119 

96 

81 

64 


76 

79 

90 

105 

no 


output of individual commodities. Seasonal indices were 
computed by averaging the ratios of actual data to twelve- 
month moving averages. (See Chapter VIII.) Wh^re there 
was evidence of progressive change in the seasonal pattern, 
the seasonal adjustments for a given commodity were 
modified from year to year. The actual adjustment for 
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seasonal change is made by dividing the daily average 
output of a given commodity in a stated month by the 
seasonal index for that month, eispressed as a ratio (i.e., 
as 1 . 10, if the conventional index were 110). The seasonally 
adjusted q would thus be reduced if the seasonal index 
were above 1.00, raised if the seasonal index were below 
1.00. In the construction of the seasonally corrected 
index, these adjusted ^’s are used in the aggregative formula 
previously described.^ 

Monthly values of this index are given in Table 83, for 
the period 1919-1937. The index is shown graphically in 
Mg, 68 on page 317. 

Dbbived Indices op Pboduction and PEODUcnvrrr 

It is possible, where suitable records of value of product 
and indices of price changes are available, to derive an 
index of production by indirection. In the case of a single 
commodity it is obvious that pff -s- p =2. (Here q repre- 
sents the number of physical units produced, p represents 
average per unit price, and pq is the aggregate value.) 
A similar process is possible in handlmg statistics relating 
to a number of commodities, in combination. Indeed, the 
records may be in the form of relatives, or index numbers, 
covering a munber of months or years. Division of a value 
index by a price index relating to the commodities included 
in the value index will yield an index measmring changes 
in physical output. 

This procedure may sometimes be used to obtain meas- 
urements that could not possibly be built up by combining 
a number of individual records. Whether the method is 
applicable in a given instance depends upon the compara- 
bility of the price and value index numbers. The strict 

^ A detailed description of the constituents of this index and of the pr^ 
cedure employed in its construction is given in the Federal Reserve BvHeHn 
for February, 1927. Revisions are noM in the issues of that BvJldin for 
March, 1932, Sept., 1933, Nov., 1936, and March, 1937. The index appears 
in current issues of the Federal Reserve BnUelm, 
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reqiiirement that the price index relate to precisely the 
commodities included in the value index cannot generally 
be met. If we assume that a given price index is fairly 
representative of the commodities covered by the value 
records, and if the formula employed in the construction 
of the index is appropriate, the method may be justified 
as a means of approximating the required index of physical 
output. 

An example of such a procedure is furnished by the 
materials in Table 84. These illustrate a method used in 
deriving an index of production of manufactmed goods. 
The indices in col. (3) are derived directly from the aggre- 
gate figures on “value added by manufacture.” The indices 
in col. (4) measure changes in average “value added” per 
unit, or cost of fabrication per unit, of manufactured goods. 
(This is, in effect, a price index, the price covering the 
services of manufacturing agents in the process of fabrica- 
tion.) This series of index numbers is based upon records 
available for a representative sample of manufacturing 
industries. The general index of manufacturing production. 

Table 84 

lUmtrating the Derivation of Index Numbers of the Physical. Volume 
of Manufacturing Production, 19^-1929 ^ 


(1) 

(2) 

(3) 

(4) 

(5) 


Total 


Index of 

Derwed 


valve added, 

value added 

index of 

Yecer 

aU census indusines 

'per unit of 

physiccd 



product, 

volume 


{in 

(in 

i'ndustries 



millions 

rdor- 

mdvded in 

production 


of doUaars) 

tives) 

sample 

(3) (4) 

1923 

25,850 

26,778 

100.0 

100.0 

100.0 

1926 

103.6 

97.3 

106.4 

1927 

27,585 

106,7 

92.4 

115.4 

1929 

31,844 

123.2 

96.8 

127.3 


* This table is taken from Economic Tendencies in the United States, New 
York, National Bureau of Ecqnomio Research, 308. 
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relating to all industries, is derived by dividing the relatives 
for total “value added” by the index numbers measuring 
changes in “value added” per unit of product (with a 
suitable shift in the decimal point). 

The derived measurements given in col. (5) of Table 84 
are probably more accurate than index numbers based upon 
directly enumerated physical products. For the gaps in 
the coverage of the latter are serious. Limitations of coverage 
are the more serious in that the excluded industries are 
in many cases just the new, rapidly developing industries 
the output of which is growing most rapidly. 

A somewhat similar process of derivation is employed in 
the construction of measxuements of industrial productivity. 
It is impossible, by direct observation, to compile records 
of output per man or per man-hour over any considerable 
area of industrial activity. However, given accurate iudices 
of physical production and comparable records of number 
of workers employed or of man-homs worked, one may 
derive iudex numbers measuring changes in productivity. 

An example of this procedure is given iu Table 85. The 
measurements given should be regarded as approximations 
only. 


Table 85 

Index Numbers of Physical Volume of Produdion, ManrHours 
Worked arid Output per Man-Hour, Manufacturing 
Industries of the United States, 1929-1935 



Physical volume 

Total number 

EeUmated 

Year 

of imnufacturmg 

of manrhours 

(rutputper 


prodiictUm 

worked 

manrhour 

1929 

100 

100 

100 

1931 

75 

66 

114 

1933 

70 

60 

117 

1935 

87 

70 

124 


Between 1929 and 1935 the total volume of manufacturing 
production declined 13 per cent. The number of man- 
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hours worked decreased by 30 per cent, however. The 
indicated gain in output per man-hour was 24 per cent. 

Measurements such as these axe of unquestioned value 
to the student of industrial change, but their limitations 
should be clearly stated. The initial necessity of full com- 
parability between the output and employment records has 
been mentioned. Discrepancies here may lead to serious 
errors in the derived measurements. More difficult to 
detect axe technical industrial changes that do not appear 
in the statistical records. Changes in the quality of the 
goods represented in the production index may lessen the 
accuracy of that index, and affect the productivity measure- 
ments. If employment is measured in terms of number of 
men employed, the resulting index of per capita output 
may be seriously distorted by changes in the length of the 
working week. Again, if orJy direct labor is enumerated 
in the employment index, a shift in technical methods that 
involves the use of a much larger proportion of indirect 
labor may lead to a great advance in apparent productivity, 
which far exceeds the real gain. Some of the gain that 
apparently follows the increased mechanization of a plant 
or a process is of this fictitious sort. Labor that precedes 
the direct act of production, and serviciog and supervising 
labor, may have replaced direct labor. Failure to take 
accoimt of the contributions of these indirect applications 
of labor may lead to grossly exa^rated measures of 
productivity gains. 

The purpose of the preceding pages has been to exemplify 
procedures used in the measurement of changes in produc- 
tion, with incidental reference to related problems. While 
there is no one standard method, it will be clear that the 
construction of quantity index numbers requires no involved 
procedure. Certain special problems — of weighting, of 
measuring secular and seasonal movements, of ensuring 
comparability when methods of derivation axe employed — 
have bera noted. In addition, most of the problems that 
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bulk large in the construction of price index numbers are 
faced in this area also. The task of obtaining accurate, 
homogeneous series of basic data entails no less careful 
field work in production than in prices. Quality changes 
lessen the accmacy of both types of index numbers. Com- 
parisons over considerable time periods are rendered inaccu- 
rate by such quality changes and perhaps even more by 
changes in “regimen” — in the complex of tastes, consuming 
habits, and technical methods that determines the wei^ting 
factors used in the construction of index niunbers. In ^ite 
of these difficulties substantial progress has been made in 
recent years in the improvement of measures of industrial 
activity of the type discussed in this chapter. More com- 
prehensive and more accurate data are being compiled, 
and technical standards in the construction of index numbers 
are being steadily raised. These gains are contributing to 
a notable advance in our knowledge of economic processes. 
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CHAPTEB X 


THE MEASUREMENT OF RELATIONSHIP: LINEAR 
CORRELATION 

In discussing averages and measures of dispersion and 
skewness we have been dealing with methods of describing 
a sin^e frequency distribution. The arrangement of the 
values of a single variable along a scale may be portrayed 
by means of these measures, which enable the central value 
to be determined and the character of the distribution 
about that central value to, be described. In the analysis 
of time series a somewhat different problem has been faced. 
In such cases we are concerned with the changing values 
of a variable factor with the passage of time, and seek to 
determine the degree to which the changes in value are 
due to the play of different forces — the secular trend and 
cyclical, seasonal, and accidental factors. The preceding 
chapters dealt with methods by which we might measure 
the effect upon a given series of each of these factors (with 
the exception of accidental fluctuations). 

Certain of these methods are applicable to the problem 
now before us. It was foimd that in dealing with time 
series the relationship between time and the long term trend 
factor may be described by a definite mathematical equa- 
tion. That is, trend or growth seems to be a function of 
time for many economic series. Where such a relationship 
prevails, whether it hold precisely or only approximately, 
there is a distinct advantage in securing a mathematical 
expression which describes it. A similar but much broader 
problem is now to be discussed. If it is possible in dealing 
with time series to secure a definite mathematical equation 
for the relation between time and the normal values of the 
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items in a given series, cannot the same device be employed 
in studying the relationship between other variables? Can 
we not define, mathematically, the relation between cotton 
production and the price of cotton, between com yield 
and rainfall, between earnings and the output of labor? 
If this can be done, it will place in the hands of the econo- 
mist a very powerful tool, ^ving his methods something 
of the precision which attaches to the work of the physical 
scientist. 

The RbliATion between Number of Taxable PebsonaIi 
Incomes and Motor Vehicle Registration 

As a typical problem we may consider the relation between 
the number of taxable personal Incomes and the number 
of passenger automobiles registered, by states in 1934. 
The figures are given in columns (2) and (3) of Table 86.^ 

These figures are plotted in Fig. 69, each dot representing 
the relation between the number of taxable incomes and 
the number of registered passenger cars for a given state. 
Such a figure is termed a “scatter diagram.” It is clear 
from this diagram that there is a relationship between the 
two variables. In general, the states with a large number 
of taxable personal incomes are also those having a large 
number of motor vehicle registrations. The relationship, 
however, is not perfect. Two states with the same number 
of taxable incomes may differ quite widely in the number 
of registered vehicles. Thus both Rhode Island and Colorado 

* Nine states for each of which there were more than 100,000 individual 
income tax returns and more than 685,000 passenger cars registered in 1934 
have not been induded. The observations for these states, some of which are 
materially affected by the presence of important industrial centers, depart 
rather widely from those for the remaining states, and are marked by a fimo- 
tional relationship between personal incomes and motor vehicle ownership 
somewhat different from that prevailing for the country at large. The states 
thus excluded are New York, Pennsylvania, New Jersey, Illinois, Massachu- 
setts, Michigan, Texas, Ohio, and California. The state of Washington has 
also been excluded, since the income tax returns for that state are combined 
with those of Alaska, in the reports of the Bureau of Internal Bevenue. The 
results are to be interpreted, of course, with these restrictions in iriinH 



Table 86 

TaxdbU Personal Incomes and Passenger AviomdrSe Registration in 
Thirty-Eight States, 1934 


(1) 

State 

(2) 

No, of taxable 
'personal m- 
comes, 1934 
(thoitsands) 

(3) 

No, ofpassenr 
ger cars reg- 
istered, 1934 
(ihomands) 

(4) 

(5) 

(«) 


X 

Y 

XY 

X* 

F* 

Alabama 

23 

192 

4,416 

529 

36,864 

Arizona 

11 

80 

880 

121 

6,400 

Arkansas 

13 

162 

2,106 

169 

26,244 

Colorado 

31 

246 

7,626 

961 

60,516 

Connecticut 

91 

310 

28,210 

8,281 

96,100 

Delaware 

11 

45 

495 

121 

2,025 

Florida 

33 

280 

9,240 

1,089 

78,400 

Georgia 

38 

317 

12,046 

1,444 

100,489 

Idaho 

9 

91 

819 

81 

8,281 


70 

680 

47,600 

4,900 

462,400 

Iowa 

48 

591 

28,368 

2,304 

349,281 

Kansas 

36 

453 

16,308 

1,296 

205,209 

Kentucky 

35 

295 

10,325 

1,225 

87,025 

Louisiana 

37 

199 

7,363 

1,369 

39,601 

Maine 

21 

141 

2,961 

441 

19,881 

Maryland 

84 

288 

24,192 

7,056 

82,944 

352,836 

Minnesota 

67 

594 

39,798 

4,489 

Mississippi 

13 

141 

1,833 

169 

19,881 

399,424 

Missouri 

98 

632 

61,936 

9,604 

Montana 

17 

97 

1,649 

289 

9,409 

Nebraska 

27 

350 

9,450 

729 

122,500 

Nevada 

5 

26 

130 

25 

676 

New Hampshire 

17 

91 

1,647 

289 

8,281 

New Mexico 

8 

67 

536 

64 

4,489 

N. Carolina 

32 

385 

12,320 

1,024 

148,225 

N. Dakota 

10 

130 

1,300 

100 

16,900 

Oklahoma 

39 

403 

15,717 

1,521 

162,409 

Oregon 

27 

233 

6,291 

3,844 

729 

54,289 

Rhode Island 

31 

124 

961 

15,376 

S. Carolina 

15 

182 

2,730 

225 

33,124 

S. Dakota 

8 

146 

1,168 

64 

21,316 

Tennessee 

38 

299 

11,362 

1,444 

89,401 

Utah 

11 

85 

935 

121 

7,2^ 

4,761 

100,489 

Vermont 

10 

69 

690 

100 

Virginia 

48 

317 

15,216 

2,304 

W. Virginia 

30 

167 

5,010 

900 

27,889 

Wisconsin 

93 

589 

54,777 

8,649 

346,921 

Wyoming 

7 

52 

364 

49 

2,704 

Totals 


9,549 

9tr 

461,558 65,236 3,610,185 
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had 31,000 taxable personal incomes in 1921, yet the former 
had 124,000 passenger cars registered, while the latter had 
246,000. Were the relationship perfect a single and unchang- 
ing value of the T-variable would always be found paired 
with a given value of the X-variable. 

Our first problem is the derivation ‘of an equation to 
describe this relationship which, while not perfect, is clearly 



Fig. 69. — Scatter Diagram Showing the Relation between Taxable 
Person^ Incomes and Passenger Car Registration, by States, in 1934, 
with Line of Average Relationship 


existent. There is here a relationship analogous to a trend, 
and it is apparently a trend which can be represented by 
a. strai^t line. The equation to a straight line, fitted by 
the method of least squares to the points on the scatter 
diagram, will express mathematically the average relationship 
between these two variables. Such a line could, of course, 
be fitted by inspection, but a more accurate result will be 
obtained by the method of least squares. 

This calls for the solution of the following normal equar 
tions: 

2(7) = JVa -f bS(X) 

2(XF) = a2(Z) + 62(X*). 

The values required for the solution of these equations may 
be derived from the data as arranged in Table 86. Sub- 
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stituting, we have 

9,649 = 38a + 1,2426 
451,658 = 1,242a + 65,2366. 

Solvii^ 

a = 66.321 
6 = 5 659. 

The required equation is 

Y = 66.321 + 6. 659Z.1 
This line is plotted in Fig. 69. 

A mathematical e:xpression has now been secured for 
the relation between the two variables being studied, the 
number of taxable personal incomes, by states, and the 
number of passenger automobiles registered. The former 
is the independent or X-variable in the equation, the latter 
the dependent or F-variable. This equation constitutes a 
measure of the functional relationship between these two 
variables, but it is only an expression of average relationship. 
How significant is the equation? If the relationship were 
perfect, and the plotted points aU lay on the line describing 
this relationship, the equation could be used with confidence 
as an accurate instrument for determining the value of one 
variable from a value of the other. But a line with a definite 
equation may be fitted to points which depart very widely 
from it, which are widely dispersed. In such a case the 
equation may have the appearance of describing a precise 
relationship but the variation is so great that it cannot be 
used with confidence. It is the same problem as that which 
arises when an average is employed. We must know how 
significant the average is, how great the concentration about 
it, before we may use it intelligently. So the equation of 

^ In the chapters on correlation capital letters X, etc.) are used to 
represent origi^ values of the variable quantitiesi as measured from the zero 
points on the scales of actpal values. Capital letters with prime marks are 
used to measure deviations from arbitrary origins^ X' and Y' for such devia- 
tions in class-interval units, X*' and Y** for such deviations in original units 
of measurement. Small letters x, y, etc.) are used to represent values of the 
variables expressed as deviations from their respective arithmetic m ea ns . 
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relationship between variables means little unless we know 
to what extent it holds in practical experience. We must 
have a measure of the dispersion about the line we have 
fitted. 

In describing the frequency distribution, the standard 
deviation is used as the best general measure of variation. 
It is, obviously, the measure we need in deter minin g the 
reliability of the equation of average relationship. The 
standard deviation about this line will not only serve as a 
general index of the significance of this equation but will 
enable us to measure the degree of accuracy of estimates 
based upon the equation. 


THE COMPUTATION OP THE STANDARD ERROR 
OP ESTIMATE 

The standard deviation about a line of average relation- 
ship, being a measure of the accuracy of estimates, may 
be termed the standard error of estimate. The term standard 
deviation is generally confined to the root-mean-square 
deviation about the arithmetic mean. The standard error 
of estimate is represented by the symbol S. 

In the computation of S we must know the computed value 
of Y which corresponds to each given value of X. By 
substituting the given values of X in the equation 

7 = 66.321 + 6.669X 

normal Y values may be computed. The deviations of the 
actual Y values from the computed may then be determined. 
The root-mean-square of these deviations is the required 
measure. A method of computation is illustrated in Table 87. 
From this table we have 




= 105.3 (thousand) motor cars. 

(The symbol Sy is used, as this is the standard error of the 
F-vaiiable.) 



Table 87 


CompiiMion of Standard Error of Estimate 


(1) 

(2) 

No, of passenger 
cars registered, 

(3) 

(4) 

d 

(6) 

State 

1934 

(m thousands) 
Y-actual 

Y-computed 

(2) - (3) 


Alabama 

192 

196.5 

- 4 5 

20.25 

Anzona 

80 

128 6 

- 48 6 

2,361.96 

Arkansas 

162 

139 9 

+ 22 1 

488.41 

Colorado 

246 

241.8 

+ 4 2 

17.64 

Connecticut 

310 

581 3 

- 271 3 

73,603 69 

Delaware 

45 

128 6 

- 83 6 

6,988.96 

Florida 

280 

253 1 

+ 26.9 

723.61 

Georgia 

317 

281.4 

+ 35 6 

1,267.36 

Idaho 

91 

117.3 

- 26 3 

691.69 


680 

462.4 

+ 217.6 

47,349.76 

Iowa 

591 

337.9 

+ 253 1 

64,059.61 

33,489.00 

Elansas 

453 

270 0 

+ 183.0 

Kentucky 

295 

264 4 

+ 30 6 

936.36 

Louisiana 

199 

275 7 

- 76 3 

5,821.69 

Maine 

141 

185 2 

- 44 2 

1,953.64 

64,262.25 

Maryland 

288 

541 7 

- 253.7 

Minnesota 

594 

445.5 

+ 148.5 

22,052.25 

Mississippi 

141 

139 9 

+ 1.1 

1.21 

Missouri 

632 

620 9 

+ 11.1 

123.21 

Montana 

97 

162.5 

- 65.5 

4,290.25 

17,134.81 

Nebraska 

350 

219.1 

+ 130.9 

Nevada 

26 

94 6 

- 68.6 

4,705.96 

New Hampshire 

91 

162.5 

- 71.5 

5,112.25 

New Mexico 

67 

111 6 

- 44 6 

1,989.16 

N. Carolina 

385 

247 4 

+ 137.6 

18,933.76 

N. Dakota 

130 

122.9 

+ 7.1 

50.41 

Oklahoma 

403 

287 0 

+ 116.0 

13,456.00 

Oregon 

233 

219.1 

+ 13.9 

193.21 

Rh^e Island 

124 

241.8 

- 117.8 

13,876.84 

S. Carolina 

182 

151 2 

+ 30.8 

948.64 

S. Dakota 

146 

111.6 

+ 34.4 

1,183.36 

Tennessee 

299 

281.4 

+ 17.6 

309.76 

Utah 

85 

128.6 

- 42.6 

1,814.76 

Vermont 

69 

122.9 

- 53.9 

2,905.21 

Virginia 

317 

338.0 

- 21.0 

441.00 

W. Virginia 

167 

236.1 

-- 69.1 

4,774.81 

Wisconsin 

589 

592.6 

- 3.6 

12.96 

Wyoming 

Total 

52 

105.9 

SSI 

- 53.9 

2,905.21 

421,250.91 
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The measure Sy is to be interpreted in precisely the same 
way as the standard deviation about an arithmetic mean. 
Given an approximately normal distribution of items about 
the line of relationship, 68 per cent of all the cases will lie 
within a range of ±5 (in this case 105.3), 95 per cent 
will fall within ±25 (in this case 210.6) and 99.7 will 
fail within ±35 (in t^ case 315.9). If there were no 
scatter about the line fitted to the points representing the 
corresponding values of X and Y, S would have a value 
of zero, and the value of Y could be estimated from the 
value of X with perfect accxiracy. The less the dispersion 
about the line, the smaller the value of S. The value of 
S serves, therefore, as an indicator of the significance and 
usefulness of the line which describes the relationship 
between the two variables. The standard error, it should 
be noted, is expressed in the same units as the original 
T-values. 


THE MABJNG OF ESTIMATES 

We may, for a moment, consider the significance of these 
results. Let us assume that, not knowing the number of 
motor vehicles registered in a given state, we are under the 
necessity of estimating it. Two methods are open to us. 
We may, in the first place, base the estimate upon our 
knowledge of the E-variable alone. The total number of 
passenger automobiles in the 38 states included in the 
study is 9,649,000. Dividing this by 38 we have 251,289 
as the average. With no specific information as to the 
registration in a given state, the arithmetic mean of all 
the state figures would be taken as the most probable value 
for the state in question. (The most probable value of a 
series of observations is the mean of the series.) How may 
we judge of the accuracy of this estimate? The standard 
deviation of the original distribution is a measure of the 
degree of variation about the mean and, therefore, a measure 
of the accuracy of an estimate based upon the mean. If 
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the distribution approximates the normal type, the chances 
are 68 out of 100 that the true value for the state in question 
will not differ from the mean by more than the standard 
deviation. The standard deviation of passenger automobile 
registration by states, as recorded in Table 86, is 178.5. 
The mean affords, therefore, a basis for a reasonable estimate, 
and the standard deviation affords some indication of the 
probabilities involved in making this estimate. 

Another method of estimatiug the motor vehicle registra- 
tion in a given state is open to us if we know the number 
of taxable personal incomes in that state. We know, as a 
result of the study described in the preceding pages, that 
the average relationship between passenger car registration 
and number of taxable personal incomes is described by the 
equation 

Y = 66.321 + 5.659Z. 

(The unit is 1,000 for each variable, it will be recalled.) 

If a state has 50,000 taxable personal incomes, it may be 
estimated from this equation that there are 349,271 passenger 
automobiles registered in that state. This is the most prob- 
able value of 7 as determined from the equation of average 
relationship. Is this estimate any better than the previous 
one, which took the mean Y as the most probable value? 
Does our knowledge of the average relationship between 
X and 7 aid us in estunating tiie value of 7 from a known 
value of X? 

The answers to these questions are given by the standard 
error of estimate, and by the relationship between the 
standard error of estimate and the standard deviation of 7. 
The standard error of estimate (that is, the standard devia- 
tion about the line of average relationship) is 105.3. The 
standard deviation of 7 is 178.5. Clearly the estimate 
made from the equation is more accurate than the estimate 
based upon the value of the mean 7. In the former case 
the odds are 68 out of 100 that the error will not exceed 
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106.3 or, in terms of the original imits, 105,300 vehicles. 
When the estimate is made from the mean, the odds are 
68 out of 100 that the error wiU not exceed 178,600 vehicles.^ 
From our knowledge of the relationship between the two 
variables, even though that relationship is by no means 
constant or perfect, we are able to reduce materially the 
errors of estimate. 

THE COEFFICIENT OF COERELATION 

We have now secured two measures which aid us in 
describing the relationship between variable quantities. 
The first is the fundamental equation of relationship, the 
expression of the degree of change in one variable associated, 
on the average, with a given change in the other. The second 
is the standard error of estimate, the measure of the degree 
of “scatter” about the line of average relationship. The 
standard error resembles the standard deviation in that 
it is a measure expressed in absolute terms, in the units 
employed in measuring the original F-values. This measure 
enables us to determine in a given case the probability that 
an estimate based upon the equation of relationship will 
fall within certain limits. 

In measuring variation it has been found that an abstract 
measure of variability is needed, one which is divorced 
from the absolute terms of the given problem. Such a 
measure is particularly needed, it was noted, when different 
distributions are to be compared. So, for measuring the 
degree of variability, a coefficient of variation is employed. 
There is need of a somewhat similar measure in connection 
with our present problem. We need a measure of the 
degree of reUdionship between two variables, an abstract 
coefficient which is divorced from the particular units 

1 la the present case, with a limited number of items and distributions whicdi 
depart somewhat from the normal i^pe, the predse probabilities cannot be so 
accurately determined from the values of 8g and try. With t.hfa qualification 
in the matter of interpretation we may use Sy and vy as useful measures of 
dispersion. 
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employed in a given case. Karl Pearson has developed such 
a coefficient. 

This measure may be es^lained in terms of the preceding 
discussion. It was found that the usefulness of estimates 
based upon the equation of relationship could be determined 
by comparing the standard error of Y (the measme of 
scatter about the line of relationship) with the standard 
deviation of Y. If the standard error be as great as the 
standard deviation the equation of relationship is of no 
use to us, but if the standard error be less than the standard 
deviation the accuracy of estimates may be improved by 
using this equation. The significance of the equation is thus 
indicated by the relation between the standard error and 
the standard deviation. But these are both in absolute 
terms, so that by dividing one by the other an abstract 
measure may be secured. Thus we might write 

s 

Measure of correlation = — • 

ay 

A somewhat more useful measure is secured by putting the 
ratio in this form: 

Measure of correlation = 

This measure, when used in connection with a linear equa- 
tion, is called the coefficient of coardaiion and is represented 
by the symbol r. 

A brief condderation of this formula will help to make 
clear the significance of r. If there is no dispersion about 
the line of relationship, Sy wiU have a value of zero; the 
equation describes a perfect relationship between the two 
variables. In this case, as is clear from the formula, r mxist 
have a value of 1. 

The TnaYiTniiTTi value of Sy is one which is equal to ay. 
Under these conditions, when the equation of relatioixship 
is of no aid in improving our estimates, the formula will 
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give zero as the value of t. Such a value indicates that 
there is no relationship between the two variables; in other 
words, that the straight line of best fit is horizontal, passing 
through the mean of the F’s. It shows that there is no 
tendency for the hi gh values of Y to be associated with 
hig h values of X or for high values of Y to be associated 
with low values of X. The two variables fluctuate in absolute 
independence. In such a case the deviation of each point 
from the fitted line is equal to its deviation from the mean, 
and the two root-mean-square deviations are equal, as 
stated. 

Zero and unity are thus the limits to the value of r. 
The values found in practical work fall somewhere between 
these limits, approaching unity ia cases where the degree 
of relationship is high. The greater the value of r, the 
greater the confidence that may be placed in the equation 
as an expression of a relation which is approximated in a 
high percentage of cases. In the example presented above, 
dealing with motor vehicle registration and nmnber of 
taxable personal incomes, we have 



This value indicates a definite and fairly close connection 
between these two variables for the states included in the 
sample. 

The coeflScient of correlation may be made somewhat 
more significant by giving it the sign of the constant 6 in 
the eqxiation of relationship. This sign indicates whether 
the slope of the line is positive or negative and, when 
attached to r, enables us to tell whether the relationship 
is direct or inverse. Thus in the present case hig h values 
of one variable are paired with high values of the other. 
The correlation is positive and the coefficient should be 
written -1- .81. When cotton production and prices are 
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correlated the relationship is an inverse one, for high values 
of one variable are generally associated with low values of 
the other. 

The naeasurement of relationship in a given case is com- 
pleted when we have secured the three measures described. 
The equation of average relationship is an expression of 
the underlying law connecting the two variables, if such a 
law may be assumed. The standard error of estimate meas- 
ures the variation, in absolute terms, about the line of rela- 
tionship. The coefficient of correlation is an abstract measure 
of the degree to which the average relationship actually holds 
in practice. 


DBiTAILS OP CALCULATION 

In the preceding section the attempt has been made to 
explain the various measures necessary in studjdng the 
relationship between variable quantities without introducing 
a detailed explanation of procedure. We may now return 
to a consideration of the details of calculation, including 
certain methods by which this calculation may be reduced 
to a TniTiiTimmi. 

The procedure followed in the preceding illustration is a 
logical one to employ in deriving the three required values. 
This method is capable of general application, but the labor 
involved may be materially reduced by taldng advanta^ 
of a short-cut method of deriving Sy. This method may be 
first explained with reference to data of the type dealt 
with above. And, for the present, the discussion will be 
confined to cases in which the relationship between variables 
may be described by a straight line. 

The first problem is the derivation of the equation of 
relation^p. A line of the type 

Y = a + bX 

is fitted by the method of least squares. 

The next step is the computation of Sy^, the square of the 
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standard error of estimate. This was done in the above illus- 
tration by measuring the deviation of each individual obser- 
vation from the fitted line, and getting the mean-square of 
these deviations. It may be shown^ that this value can be 
derived from the following equation: 

„ , S(7*) - aS(F) - 6S(XF) 

s, 

The quantities a and b are the constants in the equation to 
the fitted straight line. The other values relate to the 
original observations. Substituting in this equation a and b 
and the other necessary values, taken from Table 86, we have® 

1 The standard error of estiuiate is computed from the formula 

* N 

where d represents a single deviation from the fitted line, or the difference 
between the actual and the computed value of F in a given case. The latter 
is derived from the equation 

Fc * o + bX. 

(The symbol Ye is used to represent the computed value of F.) 
li we let F represent the actual value, we have, for each residual, 

d - Fc - F 
or 

d « a + - F. (1) 

There will be as many equations of this type as there are points. Multiply- 
ing each one by d, and ^ding, we have 

sCd*) « aS(d) -h hXidX) - S(dF). (2) 

But, since the line was fitted by the method of least squares, 

S(d) »0 
2(d^ « 0 

(for proof of this see Appendix A) 
and, therefore, 

2(d*) - - S(dF). (3) 

Betuming again to equation (1), we may multiply throughout by F, and 
add, securing 

2;(dF) = aS(F) -h hUiXY) - 2(F*). (4) 

Substituting the equivalent of S(dF) in equation (3), we have 

am - 2(F») - aZ(F) - 5S(ZF) (6) 

from which the given formula for fSy* is derived. 

* For this calculation the values of a and h are given to a greater number of 
decimal places than in the equation as first presented* 
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o , _ 3,610,186 - (66.32136 X 9,549) - (5.65925 X 461,558) 



38 

= 11,090 
Sy = 106.3. 

From this point the procedure may follow that already 
described, r being computed from the formula 


The coefficient r may be secured, however, without com- 
puting /S as an intermediate value. The above formula for 
r may be reduced to 

_ oS(y) + bSjXY) - Ncy^ 

S(r*) - NCy^ 

where Cy is the difference between the mean Y and the 
origin employed in the calculations.^ If the origin is zero 


1 The formula 


may be written 


r* =» 1 - ^ 


r* - 1 - : 


in which y refers to deviations from the arithmetic mean of the Fs. But 

N "" N ^ 

where Y represents a deviation from an arbitrary origin (in this case zero on the 
original sc^e) and Cy represents the difference between this origin and the 
mean of the F's. 

Therefore 

= l 

S(7*) - NCy* 

Substituting in this equation the equivalent of as given in the footnote 
on page 338, 

s(y«)-as(y)-bs(zy) 

BCy*) - NCy* 

Simplifying, 

flS(y) + 6s(xy)-i^rc„> 
s(y»)-jvc^» 
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on the original Y scale, c* wiU be equal to the arithmetic 
mean of the F’s. 

In the present case, using the data of Table 86, we have 

Cy = = 251.289. 

38 

The other values are the same as those employed above in 
computing Sy. Substituting in the formula, we have 

j ^ 789,228.14 
1,210,630.86 
= .6519 
r = .81. 

In effect, then, the labor of fitting a straight line by the 
method of least squares gives us practically all the values 
needed in securing S and r, the' two other measures necessary 
for a complete description of the relationship between two 
variable quantities. The only additional values required 
are 2(7®) and Cy. 

The CoNSTRUcmoN of a Coeeblation Table 

In the example presented above we had only thirty-eight 
observations. With a larger number it becomes practically 
impossible to retain the individual values in the study of 
relationships. These individual items must be grouped in 
significant classes, and all computations must be based 
upon these grouped data. This means, merely, that we must 
handle data organized in frequency distributions. Since 
we are dealing with two variables, however, the simple 
frequency table must be modified to meet the needs of 
the present problem. Such a modified frequency table, 
arranged to facilitate the computation of the values needed 
in studying relationship, is termed a corrdakion UMe. 

As a typical problem, involving the construction of such 
a table, we may consider the relation between the discount 
rates of Federal Reserve banks and the corresponding dis- 
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count rates of commercial banks. Since the paper discounted 
by conunercial banks may be rediscounted at the Federal 
Reserve banks by the member banks, some degree of 
relationship between the rates may be expected. Our present 
object is the measurement of that relationship. 

The first step is the tabulation of the original observa- 
tions. Monthly values of each variable^ were secured for 
each of the twelve Federal Reserve cities over a period of 
150 months, from July, 1920, to December, 1932. In the 
process of tabulation the items must be combined so that 
a Federal Reserve bank discount rate is paired with the 
corresponding rate charged by the commercial banks of the 
same city. Fig. 70 illustrates the method of tabulation. 

Tabulation having been completed, a correlation table 
designed to facilitate the later computations may be con- 
structed. Table 88 illustrates a suitable form. 

In Table 88, it will be noted, an arbitrary origin is em- 
ployed for each variable, and the class-interval unit is used 
in the calculations. We here employ the symbols X' and 
Y' to represent deviations from the arbitrary origin (which 
is located at point 1 .50, 3.50 on the original scales). 

COMPUTATION OP MBASUKE8 OP BEIiATIONSHIP 

From this correlation table all the values needed in 
fitting a straight line to the data, and in computing the 
measures S and r, may be derived. The quantities S(X0, 
2(X'®), S(F')> a-iid S(F'®) are computed by methods already 
familiar to the student. The product of the paired values 
S(X'F0 may be computed directly from the correlation 
table, but it is perhaps shuplm' for the beginner to re-arrange 
the data in colmnnar form, as in Table 89 on page 345. 
When the figures are disposed in this way one line is mn- 

^ The discount rates of the Federal Reserve banks rdato, for the first part 
of the period covered, to trade acceptances; for later years th^ are “rates 
for member banks on eligible paper.’' The commercial bank rates are those 
charged on customers’ prime commercial paper. The custoxnary rate over 
& given % da^ period was tidmn as of the middle of that period. 
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-Federal Reserve Bank Discourft Rate 
3.25 3.75 4.25 4.75 6.25 
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Table 89 

Discount Rates of Federal Reserve Banks and Discount Rates of 
Commerdd, Banks 

(CJomputation of values required in curve fitting) 

X' r f fX'T 
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Table 89 — Continued 

Discownt Rates of Federal Reserve Banks and Discount Rates of 
Commercial Banks 


Z' 

7' 

f 

fX'Y' 

9 

7 

63 

3,969 

10 

7 

9 

630 

11 

7 

36 

2,772 

9 

8 

7 

504 

10 

8 

9 

720 

11 

8 

1 

88 

9 

9 

1 

81 

10 

9 

1 

90 

11 

9 

2 

198 




42,932 


ployed for each compartment of the original correlation 
table in which items have been recorded. 

The values required in fitting a straight line and in 
computing the standard error and the coefficient of correlar 
tion are: 

N = 1,800 S(r«) = 62,354 

S(Z') = 10,064 S(Z'70 = 42,932 

2(70 = 6,904 2(7'*) = 30,878. 

The equation to the best fitting straight line is found to 
be 

7' = - .10277 + .70609Z'. 

Substituting in the formula 

o , 2(7'*) - 02(7') - 62(X'7') 

^ N 

we have 

„ , _ 30,878 - (- .10277 X 6,904) - (.70509 X 42,932) 

^ 1,800 
= .7314 
.855. 

To determine the value of the coefficient of correlation 
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we have only to substitute the proper values in the equation 

_ 02(70 + i'SjX'Y') - 
S(F'*) - 

When this is done we have 

(- ■ 10277 X 6,904) + (.70509 X 42,932) - (1,800 X 14.71149) 
30,878 - (1,800 X 14.71149) 

= 3,080.7178 
4,397.3180 

= .70069 
r = + .837. 

All these calculations have been carried through in class- 
interval units, with reference to an origin at point 1 . 50, 3 . 50 
on the original scales. The value of r is not affected by this 
fact, but the estimating equation and the standard error 
of estimate should be corrected. 

The value of S,, in class-interval units, is .855. Since 
the class-interval of the F-variable is .50, we have, in 
original units, 

Sy= MX .855 
= .4276. 

The equation may be corrected in a similar fashion. 
The class-interval being .50 both for X and Y, each unit 
on the original scale equals two class-interval units. Thus 
a range of 4 points on the oiigmal scale is equivalent to 
a range of 8 points on the class-interval scale. For conven- 
ience we may use Y" and X" to define deviations in original 
units (i.e., deviations from the arbitrary origin), where we 
have used Y' and X' to define corresponding deviations in 
class-interval units. Then, for any stated deviation, 
2Y" = Y'; similarly 2X" = X'. Retaining the values of 
a and h in the equation of average relation^p, and sub- 
stituting 2Y" for Y' and 2X'' for X', we have 

27" = - .10277 + .70609(2Z"). 
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Simplifying this, we have 

Y" = - .05138 + .TOSOgZ" 

which is the equation in terms of original units. 

This equation refers to an origin whose coordinates 
are 1.50 and 3.50 on the original scales. That is, 
Y" = Y - 3.50, and X" = X - 1.50, where Y and X 
define deviations, in original units, from the zero points 
on the original scales. Making these substitutions we have 

Y - 3.50 = - .06138 + .70509(X - 1.50). 

Simplifying, and rounding off the constants by dropping 
figures that are not significant, we have 

Y = 2.391 + .705Z. 

We have now the three values required for determining 
the relationship between Federal Reserve discount rates 



Fio. 71. — Scatter Diagram of Federal Reserve and Commercial Bank 
Rates, wiHi line of Average Relationship and 2lones of Estimate 


and corresponding commercial bank rates, during the period 
covered. The equation describes the average relationship, 
the standard error of estimate serves as a measure of the 



THE PRODUCT-MOMENT METHOD 349 


reliability of estimates based upon this equation, and the 
coefficient of correlation serves as an abstract measure of 
the degree of relationship between the two variables. 

The significance of the standard error, Sy, is brou^t out 
graphically in Mg. 71. The line of average relationship 
has been drawn on this scatter diagram, and what may be 
called “zones of estimate” have been marked out about 
this line. Within the zone having a width equal to 2S, 
centering at the fitted straight line, 68 per cent of all the 
points should faU, on the assumption that the distribution 
is normal. Within the zone having a width equal to 6/S, 
centering at the fitted straight line, 99.7 per cent of ail 
the points should fall, on the same assumption. The smaller 
the value of S the narrower these zones would be, and hence 
the more accurate would be the estimates which are based 
upon the equation of average relationship. 

The Peoduct-Moment Foestola foe the Coefficient 

OF COEEBLATION 

In the preceding examples the coefficient of correlation 
has been computed from the formula 

, _ aZ(F) + &S(Xy) - Ncy^ 

^ 2(7*) - Ncy^ 

which is based upon relations involved in fitting a straight 
line by least squares. The usual formula differs somewhat 
from this, and it is advisable that the student be familiar 
with it. 

When a strai^t line is fitted to data, the origin being 
at the point of averages, the two normal equations 

2(7) =Na + &2(Z) 

2(X7)= a2(Z) +62(Z*) 

become 

2(j/) =Na + VS(x) 

2( jb 2/) = aS(x) + b'S(,x^ 
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where y and x measure deviations from the point of averages. 
The first of these equations disappears and the second 
reduces to 


for 




S(a:) = 0 and S(y) = 0. 

The slope, h, is the only constant required, and this may be 
computed from the relationship 


6 = 


S(a:y) 

S(a:*) 


Under the same conditions the formula 


reduces to 


_ 02(7) + 6S(Zy) - Ncy^ 


„ _ VZjxy) 

S(y*) 

for =0 when the deviations are measured from the mean 
of the F’s. Substituting for 6 its equivalent, as just deter- 
mined, we have 


. _ :2(xy) 

But 2 ( 2 / 2 ) = and 2(x2) = 
Therefore 


and 


-a - s(gy) ■'S(xy) 


j. _ 

NVafTy 

in which x and y refer to deviations from an o rigin at the 
point of averages. 
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This fonnula may be expressed 

r = P 

in which 

^ N 

The qiiantity p is the mean product of the paired values of 
X and y. 

The computation of the coefficient of correlation from 
this formula proceeds along lines somewhat different from 
those outlined above. As we have seen, both the arithmetic 
mean and the standard deviation may be readily computed 
by the selection of an arbitrary origin from which all 
deviations are measured, a later correction being made to 
offset the error involved in usiag this arbitrary origin. 
Similarly, the mean product p may be computed by a short 
method, requiring the use of assumed means and the applica- 
tion of a correction at the end of the process. 

If x' and y' represent deviations from points arbitrarily 
selected as assmned means, while p' represents the mean 
product of such deviations, then 

W) 

^ N 

The computation of p' is not difficult, for deviations may 
be measured from central points, and may be expressed in 
class-interval irnits. Having p' we may secure the true 
mean product from the formula 

p = p' 

m which c» and Cy represent the differences between the 
true and assumed means of the x’s and y’s, respectively.^ 

^ The following is a proof of this rdationship: 

x' » deviaHon of any point from assumed mean of x’s 
X B deviation of same point from true mean of afs 
Cat » difference between true and assumed means of 
y' B deviation of same point from assumed mean of j/’s 

(Footnote 1 coatmued on page S$S) 
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The Product-Moment Method, Ungrouped Data 

This method may be illustrated with reference, first, to 
ungrouped data, using the figures for personal incomes (X) 
and passenger car registration (E), by states. The values 
required for this computation, as given in Table 86, are 


JV = 38 
S(Z) = 1,242 
S(y) = 9,549 
S(X®) = 65,236 
S(7*) = 3,610,185 
S(Zy) = 451,558. 


The mean product may be computed from the formula 
_ _ Wy’) _ 

N ~ N 

We may select as arbitrary origin the actual origin on the 
two original scales. Hence we have 

«=2(ZI)_ 


N 


C*Cy 


(When the arbitrary origin is at zero on the original scales, the 
symbol X corre^nds to a:' and Y corresponds to y', as used in 
the formulas.) 

For the two standard deviations 


{Footnote 1 continued from page SSI) 

y — deviation of same point from true mean of y^s 
Cy » difference between true and assumed means of ^s 

Xf — X ’\‘Cx 

v' -y+c„ 

= (x + cxKy +c„)-!ty+ct!t/ + cyx + cxcy. 

For the sum of all such products for N points, we have 

2(*'y') “ S(av) + <%»2(y) + +NegCy. 

But S(y) = 0 and 2(®) = 0. 

Therefore Ma/yO = 2(®j/) + NcxCy 

2(x'j/0 _ , 

= — ^ CxCy 


N N 
S(a:y) ^ S(gV) 

N N 
or p * p' - CxCy. 


— CjgCy 
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Ox — 

Cy = 



- Cx- 


-<h?. 


These measiures may be computed readily from the values 
secured from Table 86: 


= >^-32.684 


C: 

C.* = 1,068.2439 


Cy = = 251.289 

uS 

Cy^ = 63,146.1615 


P = - 8,213.1297 

= 3,670.0753 


(Tx 


068.2436 .. ■ - 63.;^ 


= 25.47 


= 178 49 


Solviug for the coefficient of correlation, 


p _ 3,670.0753 

ar^y 25.47 X 178.49 


+ .8073. 


The equation to the straight line which describes the 
average relationship between X and F may be derived 
from the values required for the preceding calculations. 
When the origin is at the point of averages this equation 
may be written 


y = 



Substituting the proper values, we have 


y = + .8073 


178.49 

25.47 


X 


= 5.657a:. 


This, with an insignificant difference, is the equation secured 
by the method of least squares. The constant term repre- 
senting the ^-intercept disappears, since the origin is at 
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the point of averages, through which the least squares line 
must pass.^ 

When the product-moment method is employed in com- 
puting the coeJEcient of correlation and in dete rmining the 
equation of regression, the standard error, S^, may be 
derived by a simple change in the formula first presented 
for r. From the expression 



we may secure the formula 

Sv = <r„Vl — r* 

which enables us to compute Sy, if we have the values of 
ffy and r. In the present case, 

Sy = 178. 49 VI - .8073 
= 106.3. 


The Pboduct-Mombnt Method, Classipibd Data 

The product-moment method is also applicable to cases 
in which it is necessary to construct a double frequency or 


1 That the formula y = r^x is equivalent to the formula based upon the 

try 


method of least squares may be readily demonstrated. When the line passes 
through the point of averages, the equation, F • a + bX, becomes y ^bx. 

. We may write, accordingly, 


But h 


This is equivalent to 

ye = r-^x 

for the latter may be written 


Nayffx tTx 

(3)y« 

(2) y,, 

Nvx • 

(4)y« 



(The symbol yo is employed for the computed vslue of y, in these equations, 
to avoid confusion with the actual y’s which appear in the light-hand members 
of the equations.) 
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correlation table. The procedure is shown in detail in 
Table 90. 

This table is identical with that previously presented for 
the same data, except that a different arbitrary origin has 
been selected. 

The value 4.50 is adopted as the assumed Tnean of the 
X’s (M'x), and the value 6.50 as the assumed mean of 
the P’s (M'y). Deviations are measured in class-interval 
units from this origin. In each compartment of the correla- 
tion table there are three figures, involved in the computation 
of X(x'y'). The figure in the center indicates the number 
of items falling in that compartment. Thus there are 
seven pairs having X values between 5.75 and 6.25 (mid- 
point 6.0) and Y values between 7.25 and 7.75 (mid-point 
7.5). For each of these pairs x' (the deviation from the 
assumed mean of the X’s) is + 3, in class-interval units, 
and y' (the deviation from the assumed mean of the P’s) 
is -b 4, in class-interval units. For each pair, therefore, 
x'y' = + 12. This figure appears at the top of the compart- 
ment. But there are seven pairs in this compartment, so 
the sum of x V for this group is + 84. This figure appears 
in parentheses at the bottom of the compartment. To 
secure S(x’j/0 for the entire table it is necessary to add 
algebraically the values secured in this way for all com- 
partments. The addition is first carried out for the different 
rows, the sub-totals being given in the column at the right 
of the table. It is foxmd Idiat 'S(x'y^ = + 4,492, in class- 
interval units. 

Details of the computation of the coefficient of corrdation 
are given in Table 91 on page 358. The values of and Cy 
are obtained by methods already familiar. 

We have, from that table, 

Hixy) _ 

~N W~ 

= + 2.4277. 
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This is the value of p, the meaiU product, in class-interval 
units. Proceeding, 

r = = P 

= + .837. 

In computing r, both the numerator and denominator of 
the final fraction (the mean product and the two standard 
deviations) are in class-interval units. Since this is true, 
r may be computed directly without reducing the figures 
to the original units. The entire operation, therefore, is 
carried on in simple class-mterval units. 


Table 91 


CakuLoMon of the Coefficient of Correhiion between ffic Discount Rates 
of Commereud BanJos and of Federal Reserve Banks 
(Calculations based on the entries in Table 90) 


M's 


4.50 
- 746 




- .414 
.171 
3.614 


1,800 
(- 414)s 
6,506 
1,800 

*3,614- .171 
*3.443 


<ra * 1.855 

Ms * 4.50 - .5(.414) 
-4.293 


Jkf'tf *5.50 


Cy* 


; * - 164 

1,800 

« (- 164)a * .027 


4,446 

1,800 


2.470 


— 

* 2.470 - .027 


*2.443 


* 1.563 

My -5.50 - .5(.164) 
-5 418 


„ * 2 W) _ 

V — rz CzCj, 

N 


. 4,492 
* 1,800 ‘ 


(-.414ap - .164) 


- 2.4956 - .0679 
* +2.4277 

CaO-y 

+ 2 4277 
" (1 855)(1.663) 
_ +2.4277 
2.8994 


r * + .837 


Notb; The class-interval unit has been employed in all the computations 
shown in this table. 


In deriving the equation to the strai^t line which 
describes the average relationship between x and y from 
the formula 


Oy and <r* should be expressed in units of the original scales.* 

iWhen the class-intervals happen to be the same, as in the present case, 
the change is not necessary, as the relation between numerator d^omi- 
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This is done by multiplying the present values by the 
class-intervals. 


<r* (in original units) = 1.855 X .50 = .9275 
ffj, (in original units) = 1.563 X .50 = .7815. 

Substituting the given values in the formula, we have 


y = 


.837 


.7815 

.9275* 


= .705®. 


The Lines op Regression 

In the above discussion certain terms ordinarily employed 
in the treatment of correlation have been purposely omitted. 
Several of these should be explained. 

The equation to the line of best fit in the preceding 
illustration was found to be 

y = .705® 

when the origin was taten at the point of averages. In 
this equation y is expre^ed as a function of X] that is, a; is 
taken to be the independent variable and y the dependent 
variable. The equation expresses the average variation in 
y (discount rates of commercial banks) corresponding to a 
change of one unit in x (discoTmt rates of Federal Reserve 
Banks). This line of relationship corresponds precisely to 
a line of trend, which describes the average change in a 
given series accompanying a unit change in time. A line 
which thus describes the average relationriiip between two 
variables is termed a line of regression. Its equation is 

termed a regression equation, and the quantity r — which 

gives the slope of such a line is called a coefficient of regression. 
The use of these terms dates back to early studies by 
Galton, dealing with the relation between the heists of 


nator is not altered, lii practice it is advisable always to express the two 
standard deviations in origin^ units at this stage of the calculations. 
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fathers and the heights of sons. Sons, Galton found, deviated 
less on the average from the mean heights of the race than 
their fathers. Whether the fathers were above or below 
the average, the sons tended to go back or regress towards 
the mean. He therefore termed the line which graphically 
described the average relationship between these two vari- 
ables the line of regression. The term is now used generally, 
as indicated above, though the original meaning has no 
significance in most of its applications. 

In any given case equations to two lines of regression 
may be computed. One is an expression of the average 
relationship between a dependent F-variable and an inde- 
pendent X-variable; the other describes the relationship 
between a dependent Z-variable and an independent 
F-variable. The significance of the two may be indicated 
graphically. 

Figure 72 is derived directly from the scatter diagram 
presented in Fig. 71. The circle in each column represents 
the mean F-value of all the items falling in that column. 
Thus in the third column there are 40 cases including all 
those with Z-values falling between 2.25 per cent and 
2 . 75 per cent. The F-vaJues vary, however, being distributed 
as shown in Table 92. 


Tables 92 


Compvltatim of the Arithmetic Mean of an Array 


ClasB^nt&rval 

Mid-point 

m 

Frequency 

f 

fm 

4.75 

- 5 24 

5.0 

4 

20 0 

4.25 

- 4.74 

4.5 

16 

72 0 

3.75 

- 4.24 

4 0 

19 

76.0 

3.25 

- 3.74 

3 5 

1 

3.5 




40 

171.5 


M = = 4.2875. 

40 

Similar mean values are obtained for the other columns. 



THE LINES OF REGRESSION 361 

These are plotted in Fig. 72, together with the line of 
regression of Y on X. 

In Fig. 72 the X-variable (Federal Reserve bank discount 
rates) is independent. As it increases from 4.0 per cent 
to 4.5, 5.0, 5.5 per cent, and so on, the average of com- 
mercial bank rates increases also. An average commercial 



1.25 1.75 2.25 2.76 3.25 3.76 4.25 4.75 5.25 5.75 6.25 6.75 7,26 
Means of (3.60)(3.90) (4.28K4.56)(4.86) (S.ll)(5.60)(5.96)(6.40) (6.69)(7.25)(7.02} 
Columns Federal Reserve Bank Rates ^Percent 


Fig. 72. — Showing the Relation between Discount Rates of Commercial 
Banks and Federal Reserve Bank Discount Rates. (The broken line 
connects the means of the columns and the straight line ^ows the average 
change in commercial bank rates corresponding to a unit change in Federal 
Reserve bank rates; i.e., it represents the regression of F on Z) 

bank rate of 4.29 per cent was associated with an average 
Federal Reserve bank rate of 2.5 per cent; an average 
commercial bank rate of 4.56 per cent was associated with 
an average Federal Reserve bank rate of 3.0 per cent, 
and so on. (The commercial bank rates cited are the means 
of the entries in the columns refOTied to.) llie slope of 
the straight line, which is the line of regression or the line 
of average relationship, measures the average increase in 
commercial bank rates corresponding to a unit increase in 
Federal Reserve bank rates. 

It is possible to view the relationship between these two 
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variables in another light. These questions arise; Given 
a certain conunercial bank discount rate, what is the average 
Federal Reserve bank rate associated with it? And for a 
given change in commercial bank discount rates, what is 
the average change in the corresponding Federal Reserve 
bank rates? The commercial bank rate is now looked upon 
as independent, and the Federal Reserve rate as an associ- 
ated dependent variable. These questions are answered by 
Fig. 73. The points marked by the small circles and con- 



Federal Reserve Bank Rates— Barcent 


Fio. 73. — Showing the Belation between Federal Beserve Bank Discount 
Rates and the Discount Rates of Commercial Banks. (The broken line 
connects the means of the rows and the strai^t line eOrows the aveiage 
change in Federal Reserve bank rates corresponding to a unit change in 
commercial bank rates; i.e., it represents the regression of X on F) 

nected by the broken line riiow the locations of the arithmetic 
means of the items falling in the various rows. Thus the 
16 X-items in the bottom row have an average value of 
2.75 per cent. This is the average Federal Reserve bank 
discount rate associated with a commercial bank rate of 
3.5 per cent. The average Federal Reserve bank rate 
associated with a conunercial bank rate of 4.0 per cent is 
2.93 per cent, and so on. The straight line fitted to these 
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points indicates the relationship between the two, its slope 
measuring the average increase (or decrease) in Federal 
Reserve bank rates associated with a unit change in com- 
mercial bank rates. 

This is the line of regre^ion of X on Y. The general 
formula for the equation to thii^ linpi is: 


a-x 

X = r — y. 

Substituting the present values, we have 


X 


.837 


.9275 

.7815*' 


or 


X - . 9932 /. 

The factors in this equation, it will be seen, are the same as 
those entering iato the formula for the line of regression of 
y on x\ If r is equal to 1 the two lines coincide, and if, 
in addition, the two standard deviations are equal, the line 
of regression will bisect the angle formed by the axes. 
If the points be plotted on a chart scaled in units of the 
standard deviations, we have y =rx; the slope of the line 
of regression is then equal to the value of r. 

The coefficient of regression is represented by the sym- 
bol b. In & simple correlation problem there are two such 
coefficients, representing the slopes of the two lines of 


1 The formula x ^ r — y 

try 

may be reduced to x ^ 

This is the equation to a line fitted to the points pbtted in Fig. 73 in such a 
way that the sum of the squares of the harixorOaL demotions is a minimum. 
The formula 


y 


2 (**) 


is the equation to the Hnft for which the sum of the squares of the vertical 
demotions is a Tninimnm- An understanding of this point may make clear 
the difference between the two lines of regression. 
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regression. These are 



ay 


(The subscripts indicate the relation between the two varia- 
bles. The first subscript refers to the dependent variable in each 
case.) 

The coefficient r appears in both formulas. This being 
so, it is clear that r may be computed from the regression 
coefficients. For 


V6*» • 



_ Vr* = 


r. 


Thus if we know the slopes of the two lines of regression 
r may be detennined. In the present example 

r = \/7705ir7^ = .837 


USB OP THE EQUATIONS OP BBGRBSSION 


The two equations of regression given above 


and 


y = .705x 
X — .993y 


describe relations between deviations from the respective 
arithmetic means. That is, the origin is at the point of 
averages, and to use the equations we cannot use the 
original values of X and Y but must egress them as devia- 
tions from their means. For example, we wish to determine 
the normal commercial bank rate associated with a Federal 
Reserve bank rate of 6 per cent. The mean value of the X- 
variable (Federal Reserve bank rates) is 4.293 per cent. A 
rate of 6 per cent represents a deviation from the mean 
of + 1 . 707. Substituting this value in the first of the 
above equations, we have 
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y = .706 X (+ 1.707) 

= + 1.203. 

This is the average jz-deviation associated with an i-deviation 
of + 1 . 707. To get the normal commercial bank rate 
associated with a Federal Reserve rate of 6 per cent the 
quantity + 1.203 per cent must be added to the mean 
commercial bank rate, 5.418 per cent. The value we wish 
is thus 6 . 621 per cent. 

This calculation has been rather round-about because 
of the form of the equation of relationship. This equation 
can be put in more appropriate form for such computations. 
Let 

X — arithmetic mean of the X’s 
Y = arithmetic mean of the y’s. 

Then 

<Ty 

y 

(Xx 

may be written 

Y -Y=^r^(X -X). 

(Xx 

In this last equation X and Y represent the values of the 
variables on the original scales, and not as deviations from 
their respective means. In terms of the coordinate chart, it 
means shifting the origin from the point of averages to a 
point corresponding to zero on each of the original scales. 

To illustrate the greater utility of the equation in this 
form, the equation 

y — ,706a: 

may be changed in the manner indicated. It becomes 

Y - 5.418 = .705CX - 4.293) 

= .705Z- 3.027 
Y = 2.391 + .705X. 

This is the equation with the origin so shifted that the 
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original values may be employed directly. To determine 
the commercial bank rate normally associated with a Fed- 
eral Reserve rate of 6 per cent we may substitute the latter 
value in the equation just secured. 

Y = 2.391 + (.705 X 6.0) 

- 6.621. 

Precisely the same results are secured as with the equation 
in the other form, but for many purposes it is preferable 
to have an equation in which the actual values may be 
inserted. 

The equation 

, X = r — y 

may be similarly changed to 

X-X^r-{Y -1). 


Stjmmaht op Coereiation PrOCBD'DIIE 

In the foregoing pages there have been presented two 
quite different methods of sec;iring the values required in 
measuring the relationship between two variables. The 
steps in the two methods may be briefly summarized. The 
method of least squares is basic in both cases, but that term 
may appropriately be employed to describe the first method 
outliaed, for the process of fitting the line is the first and 
fimdamental step in that procedure. 

I. The Least Squares Method. 

A. Data to be handled as individual items. 

1. Fit a straight line to the data by the method of least 
squares. A simple arrangement of the data in columns 
will permit the ready computation of the required 
values, S(Z), S(F), 2(X*), S(F»), 2(ZF). The equar 
tion thus obtained describes the average relationship 
between the two variables. 
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2. Compute the standard error of estimate, S„, from the 
formula 

„ , _ S(r*) - aS(F) - bS(XY) 

S. 

jSv is a measure of the reliability of estimates based 
upon the equation of relationship, and is to be inter- 
preted in the same way as is the standard de\iation 
about an arithmetic mean. 

3. Compute the coefficient of correlation, r, from the for- 
mula 



or from 

, aS(F) + 6S(ZF) - 
^ S(F») - NCy^ 

Give r the sign of the constant 6 in the equation of regres- 
sion. This coefficient is an abstract measure of the degree 
of relationship between the two variables, in so far as this 
relationship may be described by a straight line. 

4. If an equation describing the regression of Z on F 
(Z bdlng dependent) is deared, the proper values may 
be substitute in the two normal equations 

S(Z) =Na + bS(F) 

S(ZF) = aS(F) + &2(F*). 

The equation secured will be of the t3q3e 
Z = o + bF. 

The standard error of estimate, 5*, may be computed by 
TTialcing the appropriate changes in the formula as given 
for S„. The value of r will be the same as in the pre- 
ceding case, in which F is dependent. 

B. Data to be classified. 

1. Select an appropriate class-interval and tabulate the 
items in the form of a correlation table. 

2. Compute the necessary values for fitting a straight line 
to the data. In doing so, an arbitrary origin may be 
selected for each variable, and all values expressed in 
class-interval units. A re-arrangement in columnar form 
may facilitate the computation of the quantity 

S(Z'F0. 
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3. Compute the standard error of estimate, employing the 
formula given above. 

4. Compute the coefficient of correlation from the formula 
given above. 

5. If the above calculations were carried on in class-interval 
units, the equation of average relationship and the stand- 
ard error of estimate should now be expressed in terms of 
the original units of measurement. If an arbitrary origin 
was employed, the equation should be corrected so that 
the variables relate to deviations from the true origin. 


II. The Product-Moment Method. 


A. Data to be handled as individual items. 

1. Arrange the paired observations in parallel columns and 
compute the quantities S(Z), 25(7), 25(A’*), 25(7®), 
25(Z7). 

2. Divide these quantities throughout by 'N. For the first 
two of these quotients we may use the symbols c, and 


Cv 

25(Z) 


N 

and 

25(7) 


N 


3. Compute the mean product from the formula 


25(Z7) 

N 


— c*c,. 


4. Compute the two standard deviations from the formulas 



6. Compute the coefficient of correlation from the formula 
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6. Determine the equations of regression by substituting 
the proper values in the formulas 



(Tx 


(Tx 

X = r — y 

(Tj, 

(Note: For each of these equations the origin is at the 
point of averages.) 

7. If desired, transfer the origin to zero on the two original 
scales by substituting the arithmetic means in the 
equations 


F - y = (X - X) 

(Tx 

X -X = r- (F - F). 

(Tv 

8. CJompute the two standard errors of estimate from the 
formulas 

Sv = <r„V3 — r® 

Sx = <r»Vl — r®. 


B. Data to be classified. 


1. Construct a correlation table as in I. B. above. 

2. Select an assumed mean for each variable. Measure the 
deviations of the various items from the assumed means 
in class-interval units. 

3. Compute and Cy in class-interval units. 

4. Compute <rx and Cy in class-interval units. 

6. Compute S(»'yO in class-interval units for each compart- 
ment of the correlation table. Total these figures to get 
2(«V0 for the whole table. 

6. Determine the value of the mean product in class-interval 
units from the formula 


N 




7. Computer from the formula 


r 


P 

(Taffy 
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8. Reduce ff* and to oripnal units. 

9. Determine the equations of regression by substituting 
the proper values in the formulas 

(Ty 

y = r-^x 
Oz 

and 

X = r—y 

Cy 

10. If desired, transfer the ori^n to zero on the two original 
scales from the formulas 

7_F = r^(X-Z) 

<Tx 

X-X = r^(F- ?)• 

(Ty 

11. Compute the two standard errors of estimate from the 
formulas 


&y — ffvVl — r* 

S>x = ffiVl — r\ 

It is advisable, in all cases, to construct scatter diagrams 
and to plot the lines of regression thereon. It is generally 
possible to derive from such diagrams a truer idea of the 
relations involved, and of the adequacy of the methods 
employed, than may be obtained from a study of the figures 
alone. 


LIMITATIONS 

A question naturally arises as to the degree of generality 
attaching to the measures of relationship described in the 
preceding pages. Are they limited to certain types of dis- 
tributions, or may they be employed as absolutely general 
and universally valid measures? 

As we have seen, the standard deviation has a precise and 
definite meaning with respect to distributions following the 



LIMITATIONS 


371 


normal law. Having values of the mean and of the standard 
deviation, we know, in such cases, the escact percentage 
of observations which will fall within any stated limits. 
If the distribution departs from the normal type the standard 
de-\aation is stiB a useful measure, but it cannot be inter- 
preted in the same exact sense. Bearing this in mind, the 
formula 



may be considered. 

When the distribution of the original values of the 
dependent variable about their mean is normal and the 
distribution about the least squares line is normal, both 
Sy and have specific and exact meanings, and it is per- 
fectly legitimate to compute such a measure as r, based 
upon the relation of one to the other. Departures from 
normality in either case reduce the significance of this 
comparison. But we have seen that the standard deviation 
remains a useful measure even though the departure from 
the normal type be fairly pronounced, though m the latter 
case it lacks the precise significance attaching to it in a 
normal distribution. In the same way the standard error of 
estimate and the coeflEcient of correlation may be computed 
and utilized, even when all the requirements of normality are 
not met. Care must be taken in their interpretation in 
such cases, however. It must be clearly recognized that 
these measures have their full significance only in cases 
where the original distribution of the dependent variable 
and the distribution about the least squares line are both 
normal, or approximately so, 

A simple example may make clear the effect upon the 
value of the coefEcient of correlation of an extreme departure 
from a normal distribution. In the following table are 
listed certain selected figures taken from the 1919 Census 
of Manufactures, for the State of New York. 
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Table 93 


Wage-Earners in Factories and Value of Products, 1919, in Eleven 
Cities in the Stale of New York 



Number of wage- 

Total value of 

City 

earners (in 

'products (in 

thousands) 

millions of dollars) 


{X) 

{Y) 

Batavia 

2 2 

9 

Beacon 

2 2 

10 

Coming 

3 5 

11 

Geneva 

2 5 

10 

Glens Falls 

2 8 

12 

Ithaca 

1 7 

10 

Middletown 

2 2 

10 

Peekskill 

2 1 

11 

Rensselaer 

1.4 

10 

Tonawanda 

1 8 

16 

New York Qty 

638 8 

5,261 


When the first ten of these cities, in the order listed, are 
treated as a group, the following values are secured: 

a„ — 1 8682 
Sy = 1 .8669 
r = - .034. 

(No general significance is to be attached to the above 
coefficient of correlation, for the cities were selected for the 
purpose of illustrating a particular point.) 

The ten points and the line of regression are plotted in 
Fig. 74. 

When New York City is included in the group, the values 
secured for the sample of eleven cities are 

CTy = 1509.3 
Sy = 7.53 
r = + .999988. 

The eleven points and the line of regression are plotted in 
Fig. 75. 

The reason for the markedly different results is obvious. 
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When the one very large city is included with the ten sma.!! 
cities the standard deviations of both variables are greatly 
increased. That of the F-variable (value of products) is 
increased from 1.8682 to 1509.3. But Sy, the measure of 
the scatter about the fitted line, undergoes no such pro- 


Millions 



Fig. 74. — Showing the Relation between Number of Wage-Earners in 
Factories and Vsdue of Products in Ten Selected Gties in the State of 
New York 


nounced change in value. For the ten cities it is 1.8669; 
for the eleven cities 7 . 53. This is due to the fact that the 
one exceptional case is given such a great wei^t, in fitting 
by the method of least squares, that the fitted line must 
pass through or very near the point representing this 
observation. Accordingly, S is always affected 1 ms than 
(T by a single very exceptional case. Since the value of r 
depends upon the relationship 



the presence of such a case always tends to increase the 
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value of the measure of correlation. The introduction of 
the one exceptional case in the above example changes a 
correlation coefficient of virtually zero to one of unity. 
The result, of course, is meaningless. 

While this example represents an extreme instance, the 
same distortion will be felt, to a greater or less degree, 
whenever there is a departure from a normal distribution. 

Millions 

of 



Fia. 76. — Showing the Eelation between Number of Wage-Earners in 
Factories and Value of Products in Eleven Selected Cities in the State of 
New York 


In practice the various measures of relationship cannot 
be restricted to perfectly normal distributions, but they 
must be interpreted with care when there is reason to believe 
that such disturbing influences are present. 


The Cobpmcibnt 
^ft-cneffident of ,rank_ 


OF Ranh Coerblation 
cpnelation is a measure 


of 


rela- 
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useful in providing an objective test of the existence of 
correlation. ^Ita,ii^)licatioiL.caIlsL merely . for ihfi- orderly 
ranking of observations. Thus we may rank 47 states of 
the union^ according to the number of individual income 
tax returns in 1934, and according to the number of pass- 
enger automobiles registered in that year. The results are 
shown in Table 94. 


Table 94 


lUiistrating the Compviatim of the Coefficient of Rank Correlation 


(1) 

(2) 

Bank on hasie of 
number of indir- 

(3) 

Rank on basis of 
number of pas- 

(4) 

Difference 

(5) 

Stale 

vidMol income 

senger automxh 

(2) - (3) 



tax reiuma in 
1934 

biles registered 
in 1934 

d 


Nevada 

1 

1 

0 

0 

Wyoming 

2 

3 

- 1 

1 

New Mexico 

3 

4 

- 1 

1 

S. Dakota 

4 

15 

- 11 

121 

Idaho 

5 

9 

- 4 

16 

N. Dakota 

6 

12 

- 6 

36 

Vermont 

7 

5 

+ 2 

4 

Delaware 

8 

2 

+ 6 

36 

Arizona 

9 

6 

+ 3 

9 

Utah 

10 

7 

+ 3 

9 

Mississippi 

11 

13 

- 2 

4 

Arkansas 

12 

16 

- 4 

16 


^ Washington is excluded, because the published income tax returns for that 
state include those of Alaska. 

Following are the records for the nine states not listed in Table 86. 


California 

Illinois 

Massachusetts 
Michigan 
New Jersey 
New York 
Ohio 

Pennsylvania 

Texas 


No. of taxable personal 
incomes (in ihovr 
sands) 1934 

316 

310 

243 

139 

211 

808 

210 

342 

119 


No. of passenger auto^ 
mobiles registered (in 
thousands) 1934 

1,769 

1,282 

687 

1,026 

741 

1,971 

1,453 

1,466 

1,086 
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Table 94 — Continued 

lUiiStroMng the Computation oj the Coefficient of Rank CorrelaHon 


(1) 

(2) 

(3) 

(4) 

(5) 


Rank on ham of 

Rank on basis of 




nurnber of indi- 

number of pas- 

Difference 


state 

vidml income 

senger automo- 

(2) - (3) 



tax returns in 

biles registered 

d 



1934 

in 1934 



S. Caxolina 

13 

18 

- 5 

25 

New Hampshire 

14 

8 

+ 6 

36 

Montana 

15 

10 

+ 5 

25 

Maine 

16 

14 

+ 2 

4 

Alabama 

17 

19 

- 2 

4 

Nebraska 

18 

30 

- 12 

144 

Oregon 

19 

21 

- 2 

4 

W. Virginia 

20 

17 

+ 3 

9 

Colorado 

21 

22 

- 1 

1 

Rhode Island 

22 

11 

+ 11 

121 

N, Carolina 

23 

31 

- 8 

64 

Florida 

24 

23 

+ 1 

1 

Kentucky 

26 

25 

0 

0 

Kansas 

26 

33 

- 7 

49 

Louisiana 

27 

20 

+ 7 

49 

Tennessee 

28 

26 

+ 2 

4 

Georgia 

29 

29 

0 

0 

Oklahoma 

30 

32 

- 2 

4 

Virginia 

31 

28 

+ 3 

9 

Iowa 

32 

35 

- 3 

9 

Minnesota 

33 

36 

- 3 

9 

IndiftTiA. 

34 

38 

- 4 

16 

Maryland 

35 

24 

+ 11 

121 

Connecticut 

36 

27 

+ 9 

81 

Wisconsin 

37 

34 

+ 3 

9 

Missouri 

38 

37 

+ 1 

1 

Texas 

39 

42 

- 3 

9 

Michigan 

40 

41 

- 1 

1 

Ohio 

41 

44 

- 3 

9 

New Jersey 

42 

40 

+ 2 

4 

Massachusetts 

43 

39 

+ 4 

16 

Illinois 

44 

43 

+ 1 

1 

California 

45 

46 

- 1 

1 

Penn^lvania 

46 

45 

+ 1 

1 

New York 

47 

47 

0 

0 


1,094 
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. Th e dfigcee- of correlation is- indicated by the d^ree of 
concordance between the two rankings. A precise measure 
of correlation is provided by the coeflBcient 

Pr = 1 j 

n* — 7j 

where d is a difference between the rankings of a given 
state in columns (2) and (3), and n is the number of states 
included. ‘ (The Greek letter rho (p) with subscript r is used 
as the symbol of this coefficient.) 

The method of computation is shown in Table 94. Prom 
the measurements there given we have 


_ 6 X 1,094 _ 6,564 

(47)* - 47 103,776 

= .94. 

^ This formula may be derived from the familiar product-moment formula 
for the coefficient of correlation, simpMed because of the fact that the sums 
of the squares of the deviations of the hrst n natural numbers from their mean 

T , —n 
IS equal to — — — 

12 

If we let d equal the difference between the rank of one variable and the 
corresponding rank of the other, we have, for any given pair of observations, 

— Y^x-’y (since the means of the two 
series of ranks are identical) 

2 Jd 2 =. 2(a; - yY * Sx® + Xy^ - 2Xxy 
2Xxy « Sj* + 2^2 - sd*. 


But Sa;2 


— n 


and Sy2 


n* — n 
'“iT"' 


Therefore 2Xxy - 


2n* — 2n 




BntPr = 


Y,xy 

Vs®*- Sj/* 
1 /n* — n 

2V 6 


(the product-moment formula for r) 


n« - w 



378 


LINEAR CORRELATION 


The coefficient of rank correlation is appropriate where 
it is possible to rank individuals, or other entities, on. the 
basis of abilities or qualities not open to exact measurement. 
It is also well adapted for use where the distribut ions of 
the observations depart widely from the normal type, and 
where "the usefuhies^df customary measurements would 
be seriously impaired. This point takes on particular im- 
portance in connection with tests of significance, involving 
generalizations from sample results.^ Such tests are discussed 
in later chapters. 
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CHAPTEB XI 


THE MEASUREMENT OF RELATIONSHIP 
BETWEEN TIME SERIES 

The methods of measuring correlation described in the 
preceding chapter were devised originally for the analysis 
of non-historical data, that is, for the treatment of frequency 
series rather than time series. The measurement of corre- 
lation between series in time presents certain distinctive 
problems which reqiiire separate treatment. 

We have seen that such series are affected by various 
forces, which have been classified as the secular trend, 
cyclical and seasonal fluctuations and accidental variations, 
and methods have been described by means of which the 
effects of these various forces may be isolated. This breaking 
up of a series into its component parts for separate study 
is essential in attemptmg to correlate series in time, for 
spurious and quite misleading results will be seemed if 
this is not done. The problem of correlation is that of 
securing a precise measure of the degree of relationship 
between variable quantities. But each series in time repre- 
sents the combination of a number of variables and, so 
far as possible, each should be treated separately in corre- 
latiog such series. 

The relationship between two time series as, for example, 
interest rates and bond prices, may be studied with respect 
to any or all of the following components: 

a. Secular trend. 

h. Cyclical fluctuations. 

c. Seasonal fluctuations. 

d. Changes from one time unit to the next (e.g., week to week, 
month to month, or year to year). 
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Such relationships may be studied, first, through the 
comparison of graphs, and much may be learned by this 
simple process. The similarity or dissimilarity of secular 
trends, and the general relation between cyclical movements 
may be determined by a study of such graphs. For more 
accurate comparison the coefiieient of correlation may be 
iwed, but when it is so employed it is particularly impor- 
tant that the precise nature of its employment and the 
exact significance of the results be understood. 

For the comparison of secular trends the coeflGicient of 
correlation would never be employed. The mere fact that 
two series have the same secular trend is no indication 
of a relationship of interdependence; a coefficient of correla- 
tion based upon the trend values would be meaningless. 
Moreover, much simpler methods are available for comparing 
trends. 

For the same reason a coefficient of correlation shotild 
not be based upon the original absolute values of two 
series in time, except in the rather rare case in which neither 
series is marked by a definite secular trend. The computa- 
tion of r, when dealing with ordinary statistical data, 
involves measuring the deviations of all the items from 
their respective arithmetic means, and securing the sum 
of the products of the paired deviations. When deviations 
of like sign are paired throughout r will have a positive 
value; when deviations of unlike signs are paired throu^out 
r will have a high negative value. The presence of pro- 
noimced rising or declining secular trends makes it impossi- 
ble to secure significant values for r by the employment of 
thifi method. For example, the relation between automobile 
production and the price of bacon between the years 1900 
and 1920 might be measured. The secular trend is markedly 
rising m each case. When the deviations of the annual 
figures are measxued from the arithmetic means of the 
two series, the paired items for the earlier years will be 
negative, for the later years positive. A fairly hi^ positive 
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value for r would be secured, were the computation carried 
thr o ugh on this b'asis. This value would be quite misleading, 
for no real relationship can be expected in this case. The 
coefficient of correlation in such a case would measure, 
primarily, the relation between the two secular trends. 

This coefficient might conceivably be employed to deter- 
mine the similarity between seasonal fluctuations in two 
series, but its utility for this purpose may be questioned. 
Here Ag ain other and simpler methods are available. 

In practice, therefore, the device of correlation should 
be employed neither to measure the relation between secular 
trends nor between seasonal movements. Its use is confined 
to comparisons of two or more series with respect to cyclical 
fluctuations and with respect to the short time changes 
from month to month or year to year. And, if valid measures 
of correlation are to be secured in making such comparisons, 
the effects of forces which distort these comparisons should 
be elimmated, in so far as this is possible. The actual work 
of correlation must be preceded by a sifting process designed 
to remove such irrelevant material. Unless the data axe 
thus “distilled” the interpretation of the resulting coeffi- 
cients will be difficult. 

The Meastjreimbnt of Cobbeiation between C5tclical 
Fluctuations 

In an earlier chapter we have dealt with methods by 
which the effects of certain of the factors affecting time 
series mi^t be measured and eliminated. The spurious 
correlation due to secular trend may be avoided by measur- 
ing the deviations of the observations not from the respec- 
tive arithmetic averages but from the lines of secular trend 
of the two series. These variations, the deviations from 
trend, are the significant values if our interest centes in the 
cycles. If annual values are employed the problem of elim- 
inating seasonal fluctuations is not faced. 

To illustrate this method of measuring the relationship 
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between series in time we may undertake to determine 
whether there is any connection between cyclical fluctuations 
in cotton production and in cotton prices. Figures for 
crop years are to be employed, for the period 1901-02 to 
1935-36. 

Cotton prices require some correction before correlation 
is attempted. The raw flgures with which the investigation 



Fig. 76. — Cotton Production in the United States, Crop Years 1901-1902 
to 1935-1936, with Line of Trrad 


starts are average spot prices at New York for middling 
upland cotton, at wholesale, from September to May of 
eadi crop year. But such prices reflect not only the effects 
of varying conditions in the cotton market, but also changes 
in the general level of prices. To eliminate the effect of 
this factor the original prices are deflated by Bradstreet’s 
price index, as computed for the September-May period in 
each crop year. For this purpose, Bradstreet's index has 
been reduced to relative terms, with the average for the 




Table 95 


( 1 ) 


Crop 

year 


1901- ^2 

1902- 03 

1903- 04 
1904r-05 

1905- 06 

1906- 07 

1907- 08 
190S-09 

1909- 10 

1910- 11 

1911- 12 

1912- 13 

1913- 14 

1914- 15 

1915- 16 

1916- 17 

1917- 18 

1918- 19 

1919- 20 

1920- 21 

1921- 22 

1922- 23 

1923- 24 

1924- 25 

1925- 26 

1926- 27 

1927- 28 

1928- 29 

1929- 30 

1930- 31 

1931- 32 

1932- 33 

1933- 34 

1934- 35 

1935- 36 


Cotton Production and CoUon Prices^ 1901-1936 


(2) 

(3) 

(4) 

(6) 


Cotton prices. 



Cotton produc- 
tion in United 
States, excluding 
[inters (in ihovr 
sands of hales) 

Average of spot 
prices in N. Y. 
for middling 
upland cotton. 
Sept, to May 
(in cents per 

BradstreeVs 
price index, 
average, Sept, 
to May 

( 1913-14 « 100 ) 

Cotton prices, 
deflated 
(in cents per 
pound) 


pound) 



9,510 

8 64 

86 2 

10 02 

10,631 

9 50 

90 0 

10 56 

9,851 

13.20 

88 6 

14.90 

13,438 

8 69 

89 3 

9 73 

10,575 

13,274 

11 40 

92 3 

12 35 

10.97 

98.8 

11.10 

11,107 

11.41 

93.2 

12.24 

13,242 

9.81 

91.3 

10 74 

10,005 

14 62 

100.6 

14 53 

11,609 

14.80 

97.8 

15 13 

16,693 

10.34 

100 0 

10.34 

13,703 

12 35 

104 8 

11.78 

14,166 

16,135 

13 40 

100.0 

13.40 

8 63 

105.2 

8.20 

11,192 

11,450 

12.04 

121 2 

9 93 

18 29 

151 0 

12.11 

11,302 

29 96 

197 9 

15 14 

12,041 

30 06 

203.1 

14.80 

11,421 

38 63 

226.3 

17 07 

13,440 

16 90 

152.9 

11 05 

7,954 

18 67 

127.2 

14 68 

9,762 

26 26 

149.7 

17 54 

10,140 

31 79 

145.3 

21 88 

13,628 

24 34 

150 4 

16 18 

16,104 

20 60 

153.8 

13 39 

17,977 

14 26 

141.0 

10 11 

12,956 

20 19 

149 2 

13 53 

14,478 

20 02 

145 1 

13.80 

14,825 

17 00 

132.1 

12.87 

13,932 

10 47 

107 6 

9.74 

17,096 

6.42 

86.1 

7.46 

13,002 

6 75 

76.2 

8.85 

13,047 

10.95 

100.6 

10.89 

9,636 

12.42 

106.7 

11.64 

10,443 

11.59 

112.6 

10.29 
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crop year 1913-14 equal to 100. The origiual figures for 
the two series to be correlated, together with the corrected 
price figures, are given in Table 95. 

These data are plotted in Figs. 76 and 77. lines of trend 
fitted to the two series are shown on the charts.^ 



Fig. 77. — Fricee of Middling Upland Cotton in New York, Crop Years 
1901-1902 to 1935-1936, with Line of Trend. (Figures relate to average 
annual prices, during crop years, deflated by Bradstreet’s index of whole- 
sale prices) 


The deviation of each annual item from the secular trend 
of the given series is now to be measured, and the coefl&cient 
of correlation between these deviations is to be calculated. 
The computations appear in Table 96. 

This value of — .648 for the coefficient indicates a fair 
degree of negative correlation between deviations of cotton 
production in the United States from the line of trend and 

r The equation to the line of trend of cotton production is 

Y = 13,009.14 -f 87.96Z - 4.640Jr‘ - .1491Z‘, with origjn at 1918-19. 

The trend equation for deflated cotton prices is 
y =18.96 + .162Z - .01425Z* - .00083Z*, with origin at 1918-19. 




Table 96 


CoTnpTJiiation of Coefficient of Correlaiion, Cotton Production and 

Cotton Prices 


(1) 

(2) 

(3) 

(4) 

(5) 


(6) 

Crop 

year 

1901-62 

Deviation of Deviation of 
cotton pro- deflated cot- 
duction from ton prices 
trend (in from trend 

1,000’s of (in cents 

bales) per lb,) 

X y 

- 1,395 - 1 32 

1,946,025 

154,449 

1.7424 

+ 

xy 

1,841.40 

1902-63 

- 393 

- 72 

5184 

+ 

282.96 

1903-04 

- 1,298 

-1-3.63 

1,684,804 

13 1769 

— 

4,711.74 

1904r-65 

+ 2,161 
- 834 

- 1 59 

4,669,921 

2.5281 

— 

3,435 99 

1905-06 

+ .95 

695,556 

.9025 

— 

792 30 

1906-67 

+ 1,731 

- 42 

2,996,361 

1764 

— 

727 02 

1907-68 

- 572 

-h .57 

327,184 

3249 

— 

326 04 

1908-09 

+ 1,428 

- 1 11 

2,039,184 

3,783,025 

1.2321 

— 

1,585.08 

1909-10 

- 1,945 

-1-2 49 

6.2001 

— 

4,843.05 

1910-11 

- 476 

-h2 87 

226,576 

8.2369 

— 

1,366 12 

1911-12 

+ 3,476 

-2 14 

12,082,576 

4.5796 

— 

7,438.64 

1912-13 

+ 1,357 

- 93 

1,841,449 

.8649 

— 

1,262 01 

1913-14 

+ 1,648 

•f 45 

2,715,904 

.2025 

+ 

741.60 

1914r-15 

+ 3,542 

-4 98 

12,545,764 

24 8004 


17,639.16 

1915-16 

- 1,516 

-3 47 

2,298,256 

12 0409 

+ 

5,260 52 

1916-17 

- 1,366 

- 1.50 

1,865,956 

2,608,225 

2 2500 

+ 

2,049 00 

1917-18 

- 1,615 

+ 1 35 

1 8225 

— 

2,180 25 

1918-19 

- 968 

+ .84 

937,024 

.7056 

— 

813 12 

1919-20 

- 1,671 

-i-2 97 

2,792,241 

8.8209 

— 

4,962.87 

1920-21 

+ 275 

- 3.15 

75,625 

27,804,529 

9.9225 

— 

866.25 

1921-22 

-5,273 

+ .41 

.1681 

— 

2,161.93 

1922-23 

- 3,515 

+ 3.25 

12,355,225 

10 5625 

— 

11,423.75 

1923-24 

- 3,174 

+ 7 62 

10,074,276 

58 0644 

— 

24,185.88 

1924r-25 

+ 290 

+ 2.00 

84,100 

7,606,564 

4 OOOO 

+ 

580.00 

1925-26 

+ 2,758 

- 65 

4225 


1,792 70 

1926-27 

+ 4,637 

-3.73 

21,501,769 

129,600 

13.9129 

— 

17,296.01 

1927-28 

- 360 

- .04 

.0016 

+ 

14 40 

1928-29 

+ 1,202 

+ .58 

1,444,804 

3364 

+ 

697.16 

1929-30 

+ 1,608 

+ .07 

2,585,664 

.0049 

+ 

112.56 

1930-31 

+ 793 

-2,56 

628,849 

6 5536 

— 

2,030.08 

1931-32 

+ 4,055 

-4.24 

16,443,025 

17.9776 

— 

17,193.20 

1932^ 

+ 80 

-2.17 

6,400 

4.7089 

— 

173 60 

1933-34 

+ 265 

+ .66 

70,225 

4356 

+ 

174 90 

1934r-35 

- 2,982 

+ 2.30 

8,892,324 

5.2900 

— 

6,858.60 

1935-36 

- 1,988 

+ 1.94 

3,952,144 

3.7636 

— 

3,856 72 


171,865,603 227.2611 - 128,167.61 
S86 
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Table 96 — Coniimied 


Compwtaiion of Coefficient of Correlation, Cotton Production and 

Colton Prices 


Ox = 

Oy = 




2a;y - 128,167 61 

NtiTOy 35 X 2,216.0 X 2 548 


- .648. 


the corresponding deviations of cotton prices in New York, 
during the period covered. 

From the values already computed we may derive an 
equation for estimating the variation in cotton price associ- 
ated with a given variation in production. This regression 
equation, as we have seen, is of the type 


y = 



In the present case y and x refer to deviations from the 
parabolic lines of trend. Substituting the given values, we 
have 


y = - .648 


2.548 
2,216 ® 


y = — .00074®. 

This equation means that, on the average, a unit devia- 
tion of cotton production (®) above the line of trend was 
accompanied by a deviation of . 00074 units in cotton prices 
(y) below the line of trend. The unit employed in the 
production figures was 1,000 bales, in the deflated price 
figures, one cent. In the interpretation of the equation 
it may be simpler to use an x-unit of one million bales, 
TnH.Tring the equation of regression 


y = — .74®. 

Thus a cotton crop one million bales above trend was 
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accompanied by prices about three quarters of a cent per 
pound below trend (with reference always to deflated 
prices). This was the average relationship during the 
period 1901-1936. It did not hold in all cases, as is shown 
by the fact that r has a value of but — .648. If this, or 
a aimilar law, held perfectly, r would have a value of — 1. 

The value of S, which measures the scatter about the 
line of regression, may be computed from the formula 

Sy — OyVl — r*. 

In the present case, Sy has a value of 1.94 cents. The 
significance of this measure has been explained in an earlier 
section. 

(It should be emphasized that the use of the above 
equation for estimating future prices is dependent upon 
the validity of projecting the two lines of secular trend.) 

In the preceding analysis deviations were measured in 
absolute units, and the results could be interpreted only 
in terms of absolute units, bales of cotton and cents per 
pound. For certain pmposes it might have been more 
convenient to correlate percentage deviations from the two 
lines of trend, in which case the standard deviations and 
the equation of regression would have been expressed in 
these terms. The procedure, in this respect, vi^ depend 
in part upon the use to which the results are to be put. 
The nature of the data will also affect a decision on this 
point. The use of percentage rather than absolute deviations 
would be desirable in handling series in which the range 
of absolute deviations had changed materially during the 
period covered. 

It is obvious that in the above problem there is an 
arbitrary element which was not present in the correlation 
problenos previously studied. The deviations are measured 
from lines of trend, not from the arithmetic mfiana, and 
these lines of trend are arbitrarily selected. The use of 
different lines of trend might give quite different results. 
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In the above example the lines of trend were both power 
curves of the third degree. We might, perhaps vdth equal 
reason, assume that the imderlying trends are best defined 
by other functions. Coefficients of regression and correlation 
would have different values if this were done. The presence 
of this arbitrary element in the correlation of deviations 
from lines of secular trend detracts somewhat from the 
confidence that may be placed in the results. The critical 
problem here lies not in the mechanical process of correla- 
tion, but in the choice of an appropriate line of trend for 
each series. If, by the tests of inspection and of corre- 
spondence with such external evidence as may be available, 
it appears that the curve selected accurately represents the 
trend in each of the series correlated, the coejEcient may 
be accepted as significant. But, in the interpretation and 
use of the results, the presence of this element of personal 
judgment in the preliminary calculations must not be 
forgotten. This applies with particular force if the study 
aims to establish a functional relationship between cyclical 
fluctuations in the two series, and if an estimating (or 
regression) equation is to be based upon the results. 

The Coefetcient op Corkblation and the 
Measubement op Time Sequence 

In the correlation of cotton production and cotton prices 
the object was to measure as accurately as possible the 
effect of variations in cotton production upon cotton prices. 
An equation was secured which described this relation 
when deviations were measured from the particular lines 
of trend employed. Cotton prices were considered to be a 
fimction of cotton production, and the object of the study 
was to measure this functional relationship. We seek, 
in such cases, to determine the degree to which cycles 
in one series depend upon or reflect cycles in a related 
series, assiiTning some fimctional relationship between them. 
This is essentially the problem described in introducing the 
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subject of correlation, and generally constitutes the major 
problem in studying the relation between series of any type. 

But a second and somewhat different problem may be 
faced in certain studies of time series. Ass um i n g that 
two such series are marked by definite cycles, it is of 
interest to determine whether the cycles coincide in time, 
or whether cycles in one series consistently precede or lag 
behind cycles in the other. The coefficient of correlation 
has been foimd very useful in dete rminin g the degree of 
“lead” or “lag” in such cases. This problem is that of 
determining merely temporal relationship, as opposed to 
the functional rdaiionship that is ordinarily to be measured. 

THE RELATION BETWEEN STOCK TRICE CYCLES AND CYCLES 
OF BTJSINBSS ACTIVITY 

To illustrate the solution of a problem of this latter type, 
we may undertake to determine the relation, in time, 
between cyclical movements in industrial stock prices and 
in general business activity, as measured by the composite 
index compiled by the American Telephone and Telegraph 
Company. The monthly values of this index for the period 
1899-1937 have been presented in an earlier section. Rgures 
relating to stock prices from January, 1903, to June, 1914, 
axe given in Table 97. 

Table 97 

Cydes in Industrial Stock Prices, 1903-1914 1 

(Figures relate to deviations £rom trend in units of the standard deviation) 

JAnft 1903 1904 1005 1906 1907 1908 1909 1010 1911 1913 1913 1914 

jamuT 1 .(.2 3+18-18 + 6+10-1-4 — 2-7 

Vetmuy — .1—21+ 2+ 2. 2 +14— 16 + 3 + 5 + 1 — 4— 6—6 

March - .8-21+ 6 +1 9 + ,6 -1.1 + .8+ 8-.1-1-.7-6 

April -.6 - 20 --.7 +17 + 6-. 8+ 6 + 5--1+.8- 6-8 

May - .6 -2.1 + 2 +1.4 +.4~.6+8 + .4 0+.2- 7-8 

June -.9-21+ 3+16 + 2 - 6+.9+ 1 + 1 + 2-11-.7 

July -1.4 -1.8 + .7 +1.3 + 3 - .2 +1.1 - .4 + 1 + .2 - .9 

August -1.7 —1.6 -- .8 +1.7 — 3 + .3 +1.4 — 8—8 + 8— 7 

Septembor - 1.9 - l.S -- .7+17- .5 + .1 +1 4 - .8 - 7 + .4 - ;6 
October - 2.3 — .9 -- .8 + 1.7 - 1.8 + 2 + 1 4 0 — ,7 + .8 — 8 

November —2.4 — .8 +1 1 -fl.T — 1.9 + .6 +1 4 + 1 — 5 + 2 — *9 

Deoerober -2,1 - .1 + 1 8 + 1.7 - 1 6 + .6 + 1 8 - '.2-4 'o - *9 

1 These figures, the results of analyses by W. M. Persons, are from the 
Review cf Economic Sialisticst published by the Harvard Committee on Eco- 
nomic Research. They are based upon the average price of 12 industrial stocks. 
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The data of the two series are plotted in Fig, 78.^ From 
a comparison of the two curves in this chart it is clear 
that there is some relation between the movements in 


Scale for Scale for Index of 

Index of Industrial General Business 



Fig. 78. — Comparison of Cyclical Muctuations in Industrial Stock Prices 
and in Genersd Business Activity, 1903-1914 

the two series, but such a comparison affords no basis 
for a definite conclusion. Our object is to determine whether 
the cycles in the two series are exactly synchronous and, 
if they are not, to measure the ?iverage time interval by 
which cycles in one series precede the cycles in another. 
The significance of such studies in the analysis of the business 
cycle is obvious. 

For the study of pre-war relations data for the period 
from January, 1903, to June, 1914, may be employed. 
A coefficient of correlation is first computed for concur- 
rent items. A value of -I- .55 is secured. Next, the data 
are correlated with industrial stock prices precedm^ general 

^ The American Tdephone and Tdi^raph index here plotted is not identical 
with that given in Chapter EX. The latter is a revised series, differing in some 
respects from the origi^ index for the pre-war period that has been used in 
the present calculations. 
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business by one month. That is, the January, 1903, figure 
for stock prices is multiplied by the February, 1903, index 
of general business; the February stock price is multiplied 
by the March business index, etc. This process is carried 
through for the entire period from January, 1903, to June, 
1914. Only 137 monthly values are used in this computa^ 
tion, as compared ^dth 138 in the preceding case, for the 
January, 1903, business index and the June, 1914, stock 
price figure do not enter into the calculations. Accordin^y, 
the values c» and (the two corrections to be applied 
because the origin does not coincide with the two averages) 
and the two standard deviations will be slightly different. 
These corrections may be readily made. The coeflGicient 
of correlation secured from these computations has a value 
of + .65. The same operation is repeated with other 
pairings of the two variables. The results are summarized 
below. 


Table 98 

Coefficients of Correlation between Industrioi Stock Prices and an 
Index of General Business Activity 
(Based upon data for the period 1903-1914) 

Coefficient of Corrdaiion 


Stock prices concurrent 'nuth business index + ■ 55 

Stock prices preceding business index by 1 month + . 65 

“ “ “ “ • “ “ 2months +.70 

tt ti It It It 11 ^ U _|_ 

tt It It It It U ^ It 1^0 

It It tt tt tt “ 5 “ -j- 70 

<c tt tt tt tt 0 70 

tt tt tt tt It tt tt 

tt tt It tt It tt ^ tt 

tt tt It tt tt “ Q “ 07 

tt tt tt tt tt tt tt 0-j^ 

tt tt tt It It « It + 


These figures are plotted in Fig. 79. 

The coefficients increase to a maxiTmiTn value of + .76 
which is secured with stock prices preceding general business 
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by 4, 5, and 6 months. The stability of the coeflScients 
with the period of “lead” varying from 3 to 7 months 
indicates that there was no one specific interval, within 
the limits thus indicated, between the cyclical movements 
of these two series. From the results here given it woxild 



Lag in months 


Fig. 79. — CoefEcients of Correlation between Index of Industrial Stock 
Prices and Index of Business Activity, 1903-1914, Showing the Results 
Secured with Different Pairings. (In all pairings except that of concurrent 
items the business activity index follows the stock price index) 

appear that five months was the average interval by which 
stock prices preceded the general business index, but this 
was not sharply marked off as a constant relationship. 

With this record of pre-war relations we may contrast the 
experience of recent years. The Index of Industrial Activity 
of the American Telephone and Telegraph Company, given 
in Chapter IX, defines the state of budness. Of stock 
price index numbers, the measurements currently published 
in the Review of Economic Statistics^ are in a form best 

'This is not a homogeneous series for the entire period covered. For the 
years 1919-1924 the index is based on the average price of 20 industrial stocks 
(^e DownJones index), expressed as deviations from trend in units of the 
standard deviation. For the period 1925-1937 the official all-induaive index 
of the New York Stock Exchange (index No. 2) has been used. This index^ 

{Focinote 1 ccrntmued on page SB5) 
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adapted to our present needs, although a change in coverage 
during the period detracts somewhat from their utility for 
comparative purposes. Monthly values of this index for 
the period 1919-1937 are recorded in Table 99. The two 
series are plotted in Fig. 80. 

Table 99 


Cycles in Stock Prices, 1919-1937 ^ 


Month 

1919 

1920 

1921 1922 

1923 

1924 

1925 

1926 

1927 

1928 

Jan. 

+ 38 

+ 1 44 

- .69 - 

68 

_ 

12 

- 47 

+ 15 

+ 1 11 

+ 1 04 

+ 2 

51 

Feb. 

-f .38 

+ .78 

- 70 - 

63 

+ 

06 

- 43 

+ .17 

+ 90 

+ 1 31 

+ 2 

31 

March 

+ .70 

+ 1.06 

- .72 - 

38 

+ 

.16 

- .58 

- 21 

+ .33 

+ 1 25 

+ 2 

96 

Apnl 

+ .90 

+ 1.10 

~ 68 - 

.18 

— 

.02 

- .81 

- 12 

1 + .52 

+ 1 32 

+ 3 

37 

May 

+ 1.40 

+ 60 

- .67 - 

11 

— 

.31 

- 88 

+ 19 

t + 66 

+ 1 56 

+ 3 

38 

June 

+ 1 76 

+ 46 

- 1 16 - 

14 

— 

46 

- 82 

+ 27 

+ 81 

+ 1 39 

+ 2 

82 

July 

+ 2 01 

+ 38 

- 1.20 - 

08 

— 

70 

- 58 

+ .41 

+ 1 00 

+ 1 94 

+ 2 

89 

Aug. 

+ 1 50 

+ 04 

~ 1.32 + 

06 

— 

.66 

- 39 

+ 43 

+ 1 11 

+ 2 07 

+ 3 

37 

Sept. 

+ 1 78 

+ 11 

- 1 16 + 

10 

— 

70 

- .44 

+ 53 

+ 1 13 

+ 2 42 

+ 3 

63 

Oct. 

+ 2 14 

- .04 

- 1 12 + 

10 

— 

84 

- 60 

+ 1 01 

+ .89 

+ 2.06 

+ 3 

69 

Nov. 

+ 1.90 

- .44 

- .89 - 

16 

— 

72 

- 25 

+ .90 

1 +1 07 

+ 2.56 

+ 4, 

.39 

Dec. 

+ 1.64 

- 85 

- 70 - 

10 

- 

60 

- 02 

+ 1 05 

i +1.06 

+ 2.67 

+ 4 

41 

Month 

1929 

1930 

1931 

1932 

1933 

1934 

1935 

1936 

1937 

Jan. 

+ 4 21 

+ 1 11 

- 1 36 

-4 

02 


4 31 - 

> 2 83 

- 3 30 

- 1.6X 

— 

64 

Feb 

+ 4 11 

+ 1 28 

- 83 

-3 

90 

— 

4 66 - 2 90 

-3.37 

- 1.61 

— 

,61 

March 

+ 3 70 

+ 1 83 

- 1 22 

-4 

20 


4 62 - 

> 2 89 

- 3.61 

- 1 61 

— 

66 

April 

+ 3 84 

+ 1 69 

- 1 73 

-4 

64 

— 

3 91 - 

>2 93 

- 3 23 

- 1 92 

- 1 

11 

May 

+ 3 17 

+ 1 41 

- 2 35 

- 6 

05 

— 

3 33 - 

- 3 19 

- 3.14 

- 1 71 

- 1. 

.17 

June 

+ 3 88 

+ .13 

- 1 86 

- 5 

09 


2 90 - 3 13 

- 2 97 

- 1 61 

« 1 

44 

July 

+ 4 26 

+ 26 

- 2 16 

- 4 

60 


3 26 - 3 61 

- 2 71 

- 1 31 

- 1. 

.03 

Aug. 

+ 4 89 

+ 24 

- 2 18 

- 3 

86 

— 

2 89 - 3 37 

- 2 62 

- 1.27 

- 1 

31 

Sept 

+ 4 10 

- 51 

- 3 42 

-3 

97 

— 

3 31 - 

- 3.40 

- 2 66 

- 1.28 

-2 

04 

Oct. 

+ 1 67 

- 1 04 

- 3 23 

~ 4 

30 

— 

3.67 - 

- 3.46 

- 2 29 

- .90 

-2 

48 

Nov. 

+ 68 

- 1 21 

- 3 64 

-4 

42 

— 

3.82 - 

-3 21 

- 2.10 

- .78 

-2 

85 

Deo. 

+ .74 

- 1 65 

- 3 99 

- 4 

37 

— 

3.27 - 

-3 21 

- 1 93 

- .81 

-3 

03 


Results obtained from a study of the temporal relations 
between these two series, for the period 1919-1937, are 
given in Table 100. 

{Footnote 1 contifmed from page S9$) 

originally constructed with the figure for Jan. 1925 as 100, has here been 
expressed in terms of deviations from 100, in units of a standard deviation 
assumed to be equal to 15 on the original scale. In effect, a horizontal trend 
at the levd of Jan. 1, 1925, has been assumed for the Stock Exchange index. 
This index has also been shifted slightly in time. The index figure relating 
to the first day of a given month, in the Stock Exchange tabulations, has here 
been recorded as for the month preceding. Thus a February 1st index is en- 
tered for January, a March 1st index for February, etc. 

^ From the Review of Economic Statistics. The figures in the table define 
deviations from trend, in xmits of the standard deviation, with the assumptions 
stated in the preceding footnote. The coefficients in Table 100 are based upon 
data through July, 1987, only. 
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Table 100 

Coefficients of Correlation between Stock Prices and an Index of 
Business Activity 

(Based upon data for the period 1919-1937) 

Coefficient of Correlation 


Stock prices concurrent with business index + 85 

Stock prices preceding business index by 1 month + . 86 
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These measurements are shown graphically in Fig. 81. 

In using these coefficients we should note that the stock 
price records for part of the recent period are different 
in important respects from those employed for the pre-war 
period. In place of the 12 industrial stocks entering into 
the earlier comparisons the index for the recent period 
included 20 stocks and, later, a comprehensive list composed 
of all varieties of stocks. The market behavior of the 
broader list may have departed somewhat from the pattern 
set by the limited number of industrial stocks. The differ- 
ence between the results for the two periods is to be inter- 
preted with this fact in mind. 

In post-war years the highest degree of correlation pre- 
vailed with the business index following the stock price 
index by one month. The traditional “lead” of stock 
prices, on the basis of which the movements of these prices 
have been used as forecasters of business changes, was 
clearly reduced in this period. The actual statistical record 
we have obtained may have been affected somewhat by 
the broadening of the coverage of the stock price index 
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used, but the change in the relations between the two 
series appears to have been a real one. 

This method of measuring temporal relations between 
economic series is highly useful, but one important caution 
should be noted. The method indicates the average degree 
of lead or lag of one series, with reference to another. 
Frequently the sequences of change in economic series are 
not the same in all phases of business cycles. Thus, observa- 
tions relating to ten business cycles occurring between 1890 



Lag in Months 


Fig. 81. — CoefScients of Correlation between Index of Industrial Stock 
Prices and Index of Business Activity, 1919-1937, Showing the Besults 
Secured with Different Pairings. (In all pairings except lhat of concurrent 
items the business activity index foIlowB the stock price index) 

and 1925 indicate that pig iron prices 'preceded the general 
index of wholesale prices by 3.4 months, on the average, 
in business recessions, but followed the general index by 
5.1 months, on the average, in periods of business revival.^ 
This highly important difference would be ironed out in 
the measurement of average temporal relations by the 

' Gf. The Behamor ef Pricet, New York, National Bureau of Economic 
Besearch, 1927, 84-87. 
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correlation method. The use of an average should be 
supplemented by a study of the items entering into the 
average. Study of the relations among individual observa- 
tions at different cyclical phases is essential when correlation 
technique is employed to define time sequences among the 
movements of economic series. 

The Use of the Moving Average in Correlating 
Cycles in Time Series 

The preceding discussion has dealt only with cycles as 
measured from mathematically fitted lines of trend. But 
trend may be measured, as we have seen, by lines based 
upon moving averages, and the cyclical deviations from 
such linfts may be correlated in precisely the same way as 
deviations from other lines of trend. The arithmetic 
mean of the deviations from such moving averages will 
not necessarily be zero, as in the case of deviations measured 
from lines fitted by the method of least squares, and a corre- 
sponding correction must be made in correlating such figures. 

Moving averages are subject to the same criticism as 
are mathematical lines of trend. There can be no certainty 
that deviations from lines of trend based upon moving 
averages represent the effects of cyclical causes solely. The 
result in a given case depends upon the period of the moving 
average employed, and there is no perfect criterion by 
which to determine the best measure of trend. Significant 
and useful coefficients may be computed when deviations 
are measured from moving averages, but the presence of 
an arbitrary element in the work must be recognized and 
the results applied with corresponding reservations. 

Tbe Correlation op Short Term FLxrcTtrATiONS 

In describing the variable factors that constitute compo- 
nent elements of the values of a series in time, it was pointed 
out that the coefficient of correlation would not generally 
be employed in comparing either the secular trends or the 
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seasonal fluctuations of two series. It may be used to 
advantage in measuring either functional or temporal rela- 
tions between cyclical fluctuations, pro^dded that the effects 
of the other variables have been, so far as possible, elimi- 
nated. The coefficient of correlation and the measures 
which are employed in conjimction with it have a further 
use in dealing with time series. They may be used to meas- 
ure the relation between short term changes in two series, 
changes from year to year, month to month, or even from 
week to week or day to day, if desired. This problem is 
distinct from that studied in the preceding section and in 
the interpretation of the results the two should not be 
confused. 

There are several ways in which the problem of comparii^ 
short term fluctuations may be attacked. The absolute 
differences between successive items in two series may be 
correlated, or these differences may be expressed as per- 
centages or ratios. Table 101 illustrates the procedure 
employed in measuring the correlation between the absolute 
fluctuations from year to year (first differences) of cotton 
production and cotton prices. The original values from 
which the items in columns (2) and (3) are derived are 
given in Table 95. 

The process of computing r is identical with that em- 
ployed in preceding examples, when deviations were meas- 
ured from an arbitrary origin. The arbitrary origin in 
this case is zero, but corrections must be made in the 
various values since the algebraic siun of the given flgures 
is not zero in either case. Computations based on the 
fig;ures in Table 101 follow: 


SZ -b .933 
N ~ 34 

c.* = .000753 


= -I- .02744 




229.624987 


34 


.000753 = 2.599 
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Table 101 


Compviation of Coefficient of Correlation between Cotton Production 
and Cotton Prices, 1902-1936 
(Based upon first differences) 


(1) (2) 

Difference be- 
tween produc- 
tion in given 
Crop year and pro- 
Year ductim in pre- 
ceding year (in 
millions of 
bales) 


(3) (4) 

Difference be- 
tween price in 
given year and 
price in pre- 
ening year (in 
cents per pound, 
deflated) 


(5) 


(6) 


X 

1902- 03 +1.121 

1903- 04 - .780 

1904- 06 +3.587 

1905- 06 -2.863 

1906- 07 +2.699 

1907- 08 -2.167 

1908- 09 + 2.135 

1909- 10 -3.237 

1910- 11 +1.604 

1911- 12 +4.084 

1912- 13 - 1.990 

1913- 14 + .453 

1914- 15 + 1 979 

1915- 16 - 4 943 

1916- 17 + .258 

1917- 18 - .148 

1918- 19 + .739 

1919- 20 - .620 

1920- 21 + 2.019 

1921- 22 -6.486 

1922- 23 +1.808 

1923- 24 + .378 

1924- 25 + 3.488 

1925- 26 +2.476 

1926- 27 +1.873 

1927- 28 - 5.021 

1928- 29 +1.522 

1929- 30 + .347 

1930- 31 - .893 

1931- 32 + 3.164 

1932- 33 - 4.094 

1933- 34 + .045 

1934r-35 - 3.411 

1935-36 + .807 

+ .933 


Y X® 

+ 54 1.256641 

+ 4.34 608400 

- 5.17 12.866569 

+ 2.62 8.196769 

- 1.25 7 284601 

+ 1.14 4.695889 

- 1.50 4.558225 

+ 3.79 10.478169 

+ 60 2.572816 

- 4 79 16.679056 

+ 1.44 3 960100 

+ 1 62 . 205209 

- 5.20 3.916441 

+ 1.73 24.433249 

+ 2.18 .066564 

+ 3.03 . 021904 

- .34 .546121 

+ 2.27 .384400 

- 6.02 4.076361 

+ 3.63 30.096196 

+ 2.86 3.268864 

+ 4.34 .142884 

-6.70 12 166144 

-2.79 6.130576 

- 3.28 3.608129 

+ 3.42 25.210441 

+ .27 2.316484 

- .93 .120409 

-3.13 .797449 

- 2.28 10.010896 

+ 1.39 16.760836 

+ 2.04 .002025 

+ .75 11.684921 

- 1.35 .651249 

+ .27 229.624987 
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XY 

.2916 

+ 

.60534 

18 8356 

— 

3 38520 

26 7289 

— 

18.54479 

6 8644 

— 

7.50106 

1 5625 

— 

3.37375 

1.2996 

— 

2.47038 

2 2500 

— 

3.20250 

14 3641 

— 

12 26823 

.3600 


.96240 

22 9441 

— 

19.56236 

2 0736 

— 

2.86560 

2 6244 

+ 

.73386 

27 0400 

— 

10 29080 

2.9929 

— 

8.55139 

4 7524 

+ 

56244 

9 1809 

— 

44844 

.1156 

— 

.25126 

5.1529 

— 

1 40740 

36.2404 

— 

12.15438 

13.1769 

— 

19.91418 

8 1796 

+ 

5.17088 

18 8356 

+ 

1.64052 

32 4900 

— 

19.88160 

7.7841 

— 

6.90804 

10.7584 


6 14344 

11.6964 

— 

17.17182 

0729 


.41094 

.8649 

— 

.32271 

9.7969 

+ 

2.79509 

5.1984 

— 

7.21392 

1 9321 

— 

5.69066 

4 1616 

+ 

.09180 

.5625 

— 

2.55825 

1.8225 

- 

1.08945 

313.0067 


L80. 19834 
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Cu 

Ci,® 

(Ty 

P 

P 

r 

r 

The equation of regression and the value of Sy, computed 
from the usual formulas, are 

y = - .78x 
Sy = 2. 2o cents. 

A comparison of the different results secured in the 
preceding examples relating to cotton throws some inter- 
esting light upon the general problem of correlation. In 
fact, in the two examples, we have measured the correlation 
between measurements that are not strictly comparable — 
deviations from third degree parabolas, in the first case, 
and year-to-year fluctuations in the production and price 
of cotton, in the second. Yet, if we were seeking to estimate 
the price of cotton which would accompany a given crop, 
an estimate might be based upon either of the studies, 
the results of which are given below. 

r Sy 

CoTreUitioii of cycles in cotton production and 
prices (deviations measured from third degree 
parabolas) — . 648 1 94 cents 

Correlation of year-to-year fluctuations, same data — . 672 2 . 25 cents 

The value of r in the second example is slightly greater 
than the value seemed in the first case, though the standard 
error is also larger. The reason for this apparent contradic- 
tion has been suggested above; the standard deviation of 


-f .27 


= 4- .00794 


= .000063 

_ ./sF 


~Cy^ = 


\ N ^ T 

_ S(Z7) _ - 18 

- ^ CxC, 

= - 5.300168 

- 5.300168 
(T^y 2.599 X3.034 
= - .672. 


,/313.0067 

y ~3r~ 

- 180.19834 
34 


- .000063 = 3.034 

- (.02744 X .00794; 
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the yeax-to-year fluctuations in cotton prices is greater 
than the standard deviation about the trend of cotton prices. 

It appears that errors of estimate are less when based 
upon the results secured when deviations from third degree 
curves are correlated than when based upon the study of 
year-to-year movements. But there is a concealed assump- 
tion in the first case, the assumption that the lines of trend 
of both prices and production may be projected beyond the 
period studied. There is an immeasurable margin of error 
in this assumption, and the standard error of estimate, 
accordingly, does not give a true measure of the probabilities 
involved. No such assumption is involved in the measure 
based upon year-to-year fluctuations. 
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CHAPTEB XII 


THE MEASUREMENT OF RELATIONSHIP: 

NON-LINEAR CORRELATION 

In the preceding chapters the discussion has been confined 
to cases in which the relationship between two variables 
may be described by a straight line. The coefficient of 
correlation, r, is a measure of the degree to which two 
variables approach a linear relationship and it is signifi- 
cant only when a straight line gives a good fit to the points 
representing the paired values of X and Y. 

In fitting curves to time series, as explained in an earlier 
section, it is found that in many cases the trend is non- 
linear, and that a curve of higher degree is needed. The 
same thing is true in the field of our present discussion. 
It is possible to have a high degree of correlation between 
two variables when a straight line does not describe the 
relationship. In such a case there would be considerable 
scatter about the straight line of best fit, and the value 
of r would be misleadingly low. If a curve representing 
the real relationship could be fitted, the scatter would 
be materially reduced and the true correlation could be 
measured. The figures presented in Table 102 illustrate 
such a case. These data are plotted in Pig. 82. 

Two different curves have been fitted to the points 
plotted in this figure. One is a straight line having the 
equation 

Y = 5.038 + .0886X 

in which Y represents yield, in tons per acre, and X repre- 
sents depth of irrigation water applied, in inches. The 
degree of relationship between the two variables, as de- 

404 



ALFALFA YIELD AND IRRIGATION 405 


scribed by this line, is indicated by the coefficient of 
correlation, r, which has a value of + .69. 


Table 102 

Alfalfa Yield and Irrigation 
Summaiy of investigations at Davis, California ^ 

(The measurements m the body of the table measure yields, in tons per acre, 
in 44 experiments) 

Inches of irrigation water applied 



0 

12 


18 

24 

30 

36 

48 

60 

2 

35 

4 31 

5 

69 

6 00 

7.53 

7.58 

8 

05 

5 55 

2 

75 

4.78 

6 

46 

6.89 

7.97 

8 22 

8 

45 

7.25 

2. 

.89 

4 84 

7, 

,02 

7 96 

8.32 

8 63 

8 

63 

10 17 

3 

85 

5.83 

8 

02 

8 32 

9.43 

9 33 

8 

83 

10.70 

5 

52 

6 51 



8 38 

9.54 

9.38 

9 

52 



Average 5 94 7.52 9 96 11.06 12 48 10.62 

yield 3.88 5.63 6 80 7 92 8.98 9.27 9.02 8.42 7 48 

An inspection of the figure shows clearly that the straight 
line does not give the best possible fit. It is certain, there- 
fore, that r does not furnish a valid measure of the degree 
of relationship between alfalfa yield and depth of irrigation 
water. 


PARABOiiic Relationship 

The other curve in Fig. 82 is a second degree parabola, 
fitted by the method of least squares. The equation to this 
curve is 

F = 3.539 + .2527Z - .002827Z*. 

It is obvious that the effect of increasing irrigation upon 
alfalfa yield is described much more accurately by this 
latter curve than by the straight line. The most important 
result of these investigations was the determination of the 
point at which alfalfa yield began to fall off with increased 
applications of water, and the strai^t line fails to indicate 
any such decline. 

1 This table is taken from “The Economical Irrigation of Alfalfa in the Sat- 
lamento Valley” by S. H. Beckett and R. D. Robertson, BuU. No. S80, 
Agricqltural Ejiperiment Station, IJniy. of California, May, 1017. 
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As the equation of relationship, therefore, we should use 
the parabolic rather than the linear form. The standard 
error, Sy, which is a necessary accompanying measure, may 
be calculated by measuring the deviation of each value 
from the corresponding computed value, and determining 



Fig. 82. — Scatter Diagram Showing the Relation between Alfalfa Yield 
and Inigation Water Applied, with Two Lines of Regression 

the root-mean-square of these deviations. This procedure 
is illustrated in Table 103. The figures for normal yield 
which are given in this table are computed from the parabolic 
equation given above. 

Inserting the sum of the squared deviations, as given in 
col. (5) of Table 103, in the formula 



we have 
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Table 103 

Comparison of Actual and Compvled Alfalfa Yield 


(1) 

(2) 

(3) 

(4) 

(S) 

Depth of 
irrigation 
water 


XormalyieM, 

Deviation of 


Actual yield 

as computed 
from parabolic 

actual from 
norrnal 



equation 

(2) - (3) 


X 

Y 

Yc 

d 

cP 

0 

3.86 

3 54 

+ .31 

.0961 

0 

5.94 

3 54 

+ 2.40 

5.7600 

0 

5 52 

3 54 

+ 1 98 

3 9204 

0 

2.75 

3 54 

- 79 

.6241 

0 

2.89 

3 54 

- 65 

.4225 

0 

2.35 

3 54 

- 1 19 

1.4161 

12 

4 78 

6 16 

- 1 38 

1.9044 

12 

7.52 

6 16 

+ 1 36 

1.8496 

12 

6 51 

6.16 

+ .35 

.1225 

12 

4.31 

6.16 

- 1 85 

3.4225 

12 

5.83 

6.16 

- .33 

.1089 

12 

4.84 

6 16 

- 1 32 

1.7424 

18 

7.02 

7 17 

- 15 

.0225 

18 

5 69 

7.17 

- 1.48 

2.1904 

18 

8.02 

7.17 

+ .85 

.7225 

18 

6.46 

7.17 

- .71 

.5041 

24 

6.00 

7.98 

- 1.98 

3.9204 

24 

8 38 

7.98 

+ .40 

.1600 

24 

8.32 

7.98 

+ .34 

.1156 

24 

6.89 

7.98 

- 1 09 

1.1881 

24 

9.96 

7.98 

+ 1.98 

3.9204 

24 

7.96 

7.98 

- .02 

0004 

30 

7.53 

8.58 

- 1.05 

1.1025 

30 

9.54 

8.58 

+ .96 

.9216 

30 

9 43 

8.58 

+ .85 

.7225 

30 

7.97 

8.68 

- .61 

.3721 

30 

11.06 

8 58 

+ 2.48 

6.1504 

30 

8 32 

8.68 

- .26 

.0676 

36 

7.68 

8.97 

- 1.39 

1.9321 

36 

9.33 

8.97 

+ .36 

.1296 

36 

9.38 

8.97 

+ .41 

.1681 

36 

8 22 

8.97 

- .75 

.5625 

36 

12.48 

8.97 

+ 3.51 

12.3201 

36 

8.63 

8.97 

- .34 

.1156 

48 

8.45 

9.16 

- .71 

.5041 

48 

9 52 

9.16 

+ .36 

.1296 


{Continued on next page) 
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Table 103 (Continued) 

Comparison of Actual and Computed Alfalfa Yield 


(1) 

(2) 

(3) 

(4) 

(5) 

Depth of 
rmgaiion 
water 


Normal yield 

Deciation of 


Actual yield 

as computed 
from parabolic 
equation 

actual from 
normal 
(2) - (3) 


X 

Y 

Ye 

d 


48 

8 63 

9 16 

- 53 

.2809 

48 

8 83 

9 16 

- .33 

.1089 

48 

10 62 

9.16 

+ 1 46 

2.1316 

48 

8 05 

9 16 

- 1 11 

1.2321 

60 

10 17 

8.52 

+ 1 65 

2.7225 

60 

7.25 

8.52 

- 1 27 

1 6129 

60 

10 70 

8.52 

+ 2 18 

4 7524 

60 

5 55 

8 52 

- 2 97 

8.8209 


80 9945 


THE INDEX OP CORRELATION 

We need now the third value, the abstract measure of 
degree of relationship. In dealing with cases of linear 
relationship in the preceding chapter we found that such 
a measure, the coefficient of correlation, could be derived 
from known values of Sy and Cy. An analogous measure 
may be derived in the same w'ay in cases of non-linear 
relationship, such as that foimd in the present problem. 
Since the term coefficient of correlation and the ssmabol r 
refer only to cases of linear regression, we may term this 
general measure the index of correlation, and use the symbol 
p (rho) to represent it. 

As a general formula for the index of correlation we 
have^ 

^ With X dependent this fonnula becomes 



The first of the two subscripts refers always to the dependent variable, the 
second to the independent. It is essential that these be shown, for p would 
not necessarily be the same with X dependent as with Y dependent. Such a 
distinction is not necessary in the case of linear correlation, for r is the same 
no matter which variable be dependent. 
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The value of has been derived above. The value of 
a-g, computed by familiar methods, is found to be 2.27. 
Substituting in the formula for p, we have 



= .80. 


This value is materially greater than that of the coefficient 
of correlation for the same data. The value of r is + .69. 
The difference is due to the fact that the second degree 
parabola constitutes a much better fit to the data than 
the straight line. The correlation is distinctly non-hneax, 
and r is an inappropriate measure of correlation. 

THE MEANING OP THE INDEX OP COBBELATION 

It is important that the significance and the limitations 
of p be understood. Its value depends upon the relation 
between the scatter about the fitted line and the scatter 
about the arithmetic mean of the Ps. In the case of a 
straight line, p and r are identical, r being a special case 
of p. The limits of p are 0 and 1, a value of 0 indicating 
that there is no relationship, or that if there is a relation- 
ship between the two variables it cannot be described by the 
particular equation employed. A value of 1 indicates that 
the relationship, as described by the equation employed, 
is a perfect one. For curves of higher degree no positive or 
negative sign should be attached to p, for the relationship 
might be positive over part of the range and negative over 
other parts, as in the alfalfa example given above. 

The index of correlation, p, has no significance unless the 
type of curve to which it applies be named in each case. 
The meaning of r in this respect is always clear, for it is 
imderstood that it relates always to a straight line, but 
confusion would arise in the case of p unless the type of 
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curve were specifically mentioned. The index of correlation 
may be looked upon as a measure of the adequacy of a 
curve of given type to describe the relationship between two 
variables. 

It is, of course, always possible to secure a curve which 
will pass through any number of points if the constants 
in the equation be equal to the number of points. In such 
a case p would, of necessity, be equal to 1, but this value 
would have no significance. In any emplosrment of mathe- 
matical functions there is this limit of absurdity, when the 
number of constants is equal to the number of points, and 
p would merely reflect tlfis absurdity. The ordinary prin- 
ciples of curve fitting must be kept in mind in using such 
an index as this. It must never be taken to have an absolute 
significance, standing by itself. Its significance is always 
relative, referring to the particular function employed. 
This fact, which is true of every measure of correlation, 
is frequently overlooked, and invalid and fallacious con- 
clusions reached as a result. 

A SHORT METHOD OP COMPUTING THE INDEX OF CORRELATION 

The standard error and the index of correlation were 
computed by a rather laborious method in the above ex- 
ample, in order that there might be no misunderstanding 
of their precise meaning. The biuden of calculation may 
be materially reduced, however, by taking advantage of 
the relationships which were dwclosed in dealing with r. 
For a curve of the potential series 

Y a + bX + cX^ + dX^ .. . 

the formula for 8„ is derived by a simple extension of that 
employed in the case of the straight line. As a general 
formula for a series of this type, we have. 

„ , 2(7*) - 02(7) - &2(Z7) - c2(Z*7) - d2(Z»7) - . . . 

tSy — 

Similarly, the formula for r may be extended to give a 
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general formula for p applicable to any equation of this 
general type. This formula^ is 

, _ aS(7) + bS(Zy) + c2(Z*y) + dS(X»y) + . . . - AV 

S(7*) - 

In the special case in which the origia is at the mean of 
the F’s, S(y) =0 and c» = 0, and the formula reduces to 


„ , _ h-S{Xy) + cS(X^) + cS(Z»y) + . , . 

S(y*) 

The characteristics of the formulas for S and p should 
be noted. The only values required in securing these 
meastu*es are the constants in the equation which describes 
the average relationship, certain values which have been 
used in the process of fitting and, in addition, S(F®) and 
c,®. Thus, as direct by-products of the fitting process, we 
have the values of S and p, the two measures which are 
needed to supplement the regression equation in securing 
a complete description of the relationship between the two 
variables in question. The equation describes the average 
relationship. The standard error, S, is a, measure of the 
reliability of estimates based upon this equation, and p is 
an abstract index of the degree of relationship, in so far as 
that relationship can be described by the particular curve 
employed. 

The application of these formulas may be illustrated 
with reference to the problem of alfalfa jdeld. The following 
values, derived from the data of Table 102 and from the 
fitting process, are required for this purpose: 


a = 3.539 
b = .262652 
c = - .002827 
S(F) =329.03 
S(ZF) = 10,271.72 


S(Z*F) = 407,564.64 
V = 55 .'9197 
S(F*) = 2,688.2268 
JV = 44. 


Substituting in the formula for the standard error for a 


1 See Appendix A for a discussion of the derivation of this formula. 
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second degree parabola, 

„ , 2(7®) - 02(7) - 62(Z7) - c2(Z®7) 

S, ^ . 

we have 

- , 2,888.2268 - (3 S39 X 829 03) - ( 2S3652 X 10,271 72) - f- 002827 X 407,564 64) 

* ” 44 

80.8043 
“ 44 

= 1.8366 
Sy = 1.36. 

The index of correlation, for a curve of this type, is 
computed from the equation 

, _ 02(7) + 6S(Z7) + c2(Z®7) - Ac,® 

2(7)® - Ac,® 

Substituting the appropriate values, we have 

s _ 146.9557 

2,688 . 2268 - (44 X 65 . 9197) 

= .6452 
Pyx — . 80, 

The value of the index of correlation is influenced by 
the relation between the number of observations and the 
number of constants in the equation of relationship. When 
the two are equal p will have a value of 1. In any case the 
observed index of correlation tends to exceed the true ind>=^x, 
When the number of observations is not large it is advisable 
to apply a correction for this bias. If we use p to represent 
the corrected value and m to represent the number of 
constants in the equation of relationship, we may apply 
a correction in terms of the relation* 

p,/=l_ |(l_p^2)^^^^|. 

Inserting the values given in the above example, we have 

» IVom Mordecw Eaekid, Methods of Correlation Atudysis, New York, 
1930, 121. 



THE CORRELATION RATIO 


41S 


W-l- {(l-.6452)(g^)) 

= .6279 
Pyx = . 79. 

If, in the application of this test, the value in brackets { } 
exceeds unity, the value of p is taken as 0.* 

These methods of deriving S and p are applicable over 
a wide field by a simple adaptation of the formulas to the 
particular equations that may be employed in given 
instances. Further illustrations are given in Chapter XVII, 
while this general method is explained in more detal in 
Appendix A. 

The Cobeelation Ratio 

A third distinctive measure of correlation remains to 
be described. This is the eorrdatim ratio, devised by 
Karl Pearson and represented by the symbol r? (eta). 
This measure may be looked upon as a special case of p, 
but somewhat different methods are employed in its com- 
putation. 

We have seen that in all cases the degree of relationship 
between two variables, as described by a curve of a given 
type, may be determined from the formula 

Measure of correlation = 

The-,fi.oefficient of correlation, r, is just such a measure, 
when S„ represents the standard deviation about a straight 
UneT The index of correlation, p, is a general measure of 
the same type. The correlation ratio is precisely the same 
sort of mea^e,jSy in this case representing the standard 
deviation about a liie passing through the mean of every 

1 A oorresponding correction should be made in the standard error of esti^ 
mate, when derived from a small number of observations. In this case the 
correction must raise the unadjusted measure. For this correction Ezekiel 
gives 

where S represents the corrected standard error of estimate. 
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column in the correlation table. We have, in effect, increased 
the number of constants in the equation of the curve to 
be fitted until the number is equal to the number of columns. 
If the means of all the columns lie on a straight line, the 
correlation ratio and the eoefiicient of correlation will be 
equal. If the means of the columns do not lie on a straight 
line, the correlation ratio will be greater than the coefficient 
of correlation. 

No new principle is involved, therefore, in the concept of 
the correlation ratio. It is employed when the regression 
is non-linear. It measures the degree of relationship be- 
tween two variables, in so far as this relationship may be 
described by a curve passing through the mean of every 
column. If the relationship is perfect, if there is no scatter 
about the curve fitted in this way, will have a value of 1. 
If there is no relationship, if the scatter about the curve 
is as great as the dispersion about the mean of the Ps, rj 
will have a value of zero. 

The formula generally employed in the computation of 
the correlation ratio differs somewhat from that given above. 
To represent the standard deviation about the line joining 
the means of the columns, the symbol is employed, 

instead of Its meaning is precisely the same as that 
of Sy, as employed above, except that <r«y refers always 
to a correlation table. 

The formula may be written 

I- 

When eta is written as above (ij,w) it refers to the regres- 
sion of Y on X (T dependent). When it is written it 
refers to the regression of Z on T {X dependent), and its 
value depends upon the scatter about a line joining the 
means of the rows. Unlike r, which has the same value 
for both regressions, and ij*# will have different values 
xmless the regression be linear. 



THE CORRELATION RATIO 


415 


THE COMPUTATION OP THE COBREIiATION BATIO 

Table 104 shows the general relation between the amount 
of nitrogen, in pounds per acre, used as fertilizer in certain 
agricultural experiments, and the correspon ding yield of 
wheat, in bushels per acre.^ The points are plotted in 
Fig. 83. 


Table 104 

Correhtion Tcible Showing the Relation between Wheat Yield per Acre 
and Amount of Nitrogen Used as Fertilizer 



1 This table is based upon experiments described by E. Davenport (“CSom- 
paiallve Agrioulture’’ in Bail^s Cydopedia of American Agricidtvre). The 
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For the computation of iji,* by the formula given above 
we need the values of or® and Jav, the latter being the 
root-mean-squaxe deviation about the line joioing the means 
of the various columns. The former value may be obtained 
readily by methods already famUiar. It is possible to 
compute the quantity <ra„ by the method first employed 



Fig. 83. — Scatter Diagram Showing the Belation between Wheat Yield 
and Nitrogen applied as Fertilizer, with Straight Line of Regression and 
Line joining the Means of the Columns 


in calculating Sy, that is, by measuring and squaring the 
deviations of the individual points from the line of regres- 
sion. In the present case, however, the line describing 
the relationship passes throu^ the mean of each column, 
hence these means may be used in place of the “normal” 
values as computed from an equation of regression. In 
computmjg^jr,*^. there fore. _&e deviations o f the indi vidual 

actual figures used have been arbitrarily chosen for the purpose of the present 
illustration, but Davenport’s experiments have demonstrated the existence 
of a law similar to the one here assumed. 
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items from the means of the various columns are squared, 
added, the mean determined and the square root extracted, 
just _as in. the computation of the standard deviation. 
Part of the proeedme is illustrated m Table 105, using the 
data in the first column of Table 104, This column contains 
all items having X-values between 0 and 20. The m ftan 
E-value of the 21 items falling in this column is 5.05; 
deviations are measured from this value. 


Table 105 


Cirmputation of the Sqmres of the Deviations about the Mean of an 


Class-Interval 
(wheat yield in 
bu. per acre) 

m 

/ 

Array 

Denatimfrom 
mean of column 
(5 06) 
d 


fd^ 

8-11 9 

10 

3 

4 95 

24.5025 

73.5075 

4-7 9 

6 

10 

95 

.9025 

9 0250 

0-3 9 

Total 

2 

8 

-3 05 

9.3025 

74.4200 

156.9525 


The sum of the squared deviations is obtained for each 
of the other columns in a similar fashion. The standard 
deviation about the means of all the columns, Oar, is fomd 
to have a value of 2.420. The value of o-y is 9 . 188. 

Substituting the given values in the formula 





(2.42)* 

(9.188)* 


= 1 - .0694 
= .9306 
1Jy» = .965. 


This is the value of the correlation ratio, measuring the 
degree of scatter about a line running through the means 
of the colm33ns. Its significance is discussed below. 

The method of calculation employed in the preceding 
example may be materially shortened. Let (r«y represent 
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the standard deviation of the means of the various columns 
about the arithmetic mean of all the F’s. In computing 
this value the mean of each colimm is weighted by the 
number of items in that column. It may be shown^ that 

1 The following proof is adapted from Yule. 

Given a series with mean M made up of two component series with means Mi 
and Ms. N, the total number of observations, is equal to iVi + N2, the sum of 
the observations in the two component series. What is the relation between <r, 
<Ti and 0-2? If we let Mi — M = Ci 

then for Si^, the mean-square deviation of the observations in the first of the 
two component series, measured from M as origin, we have 

Si^ = <ri* -j- Cl®. 

Similarly 82^ = 0*2® + C2*. 

But NiSi^ is equal to the sum of the squares of the deviations, about M, of 
the items in the first of the component series, and N2S2^ is equal to the sum 
of the squares of the deviations, about M, of the items in the second of the 
two component series. Therefore 



" “ N 


and 

iv<r» = NiSt* + 

(1) 

But 

Si^ = cri® “f* Cl* SUd iSl* “ Ci* ”1“ Cl* 


therefore 

N<r> = J\ri(<ri* + Ct>) + i\r*((ri* + Ci*). 

(2) 

In the present case we have the major senes with mean represented by My, 


and a number of component senes (the items arranged by columns) with means 
represented by 7nyi, etc. Let Say represent the standard deviation of any column 
of y*s about the mean of that column. Then we have a number of component 
series, with standard deviations Sayi, etc., and with means differing from the 
mean of all the Y’s by My — myi, etc. Substituting in equation (2), we have 
iVcTj,® » H- (My — ^Wy,)*] -f- n-2[5oy2® + (My — . (3) 

JV’ffy® - Sn[Soy= + {My - my)*l. (4) 

But Ncray'^ « S(n Say^) 

Sd® 

for, in each column, Say- = — 

n 

since d represents a deviation from the mean of that column. For all columns, 

Hin^Say^) 

JV “ N ■ 

Substituting in equation (4) 

= Ncay^ + Sn(My — my)®. (5) 

By definition of the standard deviation of the means of the columns 

^ Sn(My — 7 ?ly)® 

*“ N 

<ry® = ffay^ + 


Therefore, from (6), 


( 6 ) 



THE CORRELATION RATIO 


419 


^0 ^ O _ O 

(^ay — <^y 

Substituting for 0 -*,^ in the equation 


we secure 




(Tmu^ 



Since a^v may be much more easily determined than (Tob 
the value of )? is generally computed from this fonnula. 
The data of Table 104 may be used to exemplify the process. 
Calculations appear in Table 106. 


Table 106 

lUustrating the Comjndation of the Correlation Ratio 


Type of array 
(X-value of 
item in 
column) 
(pounds) 

Mean value 
of Y-items 

Deviation 
from mean 

Square of 

Fre- 


in cdwnn 
(bushds) 

ofaUY’e 
(25 005) 

deviation 

quenay 



my 

d 

d* 

f 

fcP 

10 

5.05 

- 19 955 

398.202 

21 

8,362.242 

30 

16.12 

- 9.885 

97.713 

25 

2,442.825 

50 

24.40 

- .605 

.366 

30 

10.980 

70 

28.73 

+ 3.726 

13.876 

44 

610.544 

go 

31.73 

+ 6.725 

45 226 

37 

1,673.362 

no 

32 40 

+ 7.395 

54 686 

20 

1,093.720 

130 

32.00 

+ 6.995 

48.930 

8 

391.440 

150 

33 33 

+ 8.326 

69 306 

6 

415.836 

170 

34.00 

+ 8.995 

80.910 

2 

161.820 

Total 




193 

15,162.769 


(^my — 



15,162.769 

193 


= 8.864. 
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Substituting the given values in the fonnula 



we have 

_ 8.864 
~ 9.188 
= .965. 

The process of computiug the correlation ratio may be 
briefly sinnmarized: 

1. Arrange the items in the form of a correlation table. 

2. Find the arithmetic mean of all the F-items in each column 

(i.e., find the arithmetic mean of each F-array of type X). 

3. Compute the arithmetic mean of all the F’s. 

4. Measure the deviation of the mean of each column from the 

mean of all the F’s. Square each of these deviations and 
multiply by the number of items in the given column. Get the 
sum of the squared deviations. 

5. Divide this sum by the total number of items and extract the 

square root of the result. This gives the value of ffmv 

6. Compute (r». 

7. Divide by oTy. The quotient is 

The value of the correlation ratio of X on F may be 
similarly computed, substituting the proper values in the 
formula 


The symbol (Tna represents the standard deviation of the 
means of the various rows about the mean of all the X’s. 
The value of the correlation ratio of X on F depends upon 
the amount of scatter (horizontally) about the line joining 
the means of the rows. Its value will generally be different 
from that of the correlation ratio of F on X. In the present 
case the value of T]xy is found to be .824, As the line of 
relationship approaches the linear form the two correlation 
ratios approach identity. 
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Like r, ri can never exceed 1, this value being secured 
when there is no dispersion about the line joining the 
means of the columns (or rows). From the formula 



it is evident that the value of the correlation ratio is zero 
when (Tmv is zero. This is the case when the mean of each 
column has the same value as the mean of all the F’s. 
Such a condition is foimd when an increase or decrease in 
the value of the X-variable brings no corresponding change 
in the value of the F-variable. This means that in each 
column of the correlation table there is a distribution of 
cases sinnlaj: to the general distribution of F’s. When 
this is true there is clearly no relation between the two 
variables. 

The correlation ratio, it should be noted, aevcr has a 
negati ve, value. - .It is possible to determine by inspection 
of the correlation table, however, whether the relation 
between two variables is direct, or inverse, or a varying one. 

The coefficient of correlation has one distinct advantage, 
as compared with the correlation ratio, in that when its 
value and the values of the two standard deviations are 
known the equations to the lines of regression may be 
readily determined. This is not true of 77. To get a quantita- 
tive expression for the “law” of relationship between two 
variables, when i\ has been computed, an additional calcula- 
tion for the purpose of fitting a curve to the means of the 
arrays would be necessary. 

COEBECTION OF 00® CORBBLATION BATIO 

The use of i; is only possible when the data are numerous, 
and can be arranged in the form of a correlation table. If 
a limited T mmh er of items should be so arranged, and it 
chanced that there was but one item in each column, the 
two measures c™, and (r„ would be identical and 17 would 



422 NON-LINEAR CORRELATION 


necessarily have a value of 1. Computed from a very small 
number of cases and employing a large number of classes, 
the correlation ratio would be meaningless. 

The raw correlation ratio may be corrected by the method 
employed on a preceding page for the index of correlation, 
with m set equal to the number of groups (i.e., to the num- 
ber of columns, for to the number of rows for rt^y). 
Thus, if 5 be the corrected value, we have 


rj^ = 1 — 


|(1 - ij*) 



In the present instance 


1 ?* = 1 - |(1 - 

= .9276 
V — .963. 



The correction is very slight in the present case, but if 
N were small or m very large it would reduce the given 
value materially. 


RELATION BETWEEN THE CORRELATION RATIO AND THE 
COBPPICIBNT OF CORRELATION 

When the relation between two variables is absolutely 
hnear the line running through the means of the columns 
corresponds, of course, to the line upon which the coefficient 
of correlation is based. When this is the case jj and r have 
the same value. As the relationship between the two 
variables departs from the linear form the values secured 
for j? and r differ, rj beiug always greater than r. ^This 
resul ts from the fact that th e scatter abo ut bpe joining 
TEeTneans of the columns wffl ^ ways be leas t, h q.n the . 
scBrtxer about a siraigHt line fitted to these ppjnjts,-^eeept- 
when “ffie straighi line passes through every mean point. 
AndH!he"il^'TE5"'''scaEf5r~Rbout ''thB' 'Hne expressing the 
average relationship the greater the value of the measure 
of correlation. Thus for the alfalfa problem it was found 
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that r has a value of + .69, and that an index of correlation 
based upon a second degree parabola has a value of .80. 
The correlation ratio for the same material is .82. For 
the data of Table 104 the value of ijyx (uncorrected) was 
found to be 966; the value of r is + .793, the difference 
between the two being marked. The reason for the difference 
is found in Fig. 83, in which the strai^t line of regression 
of F on X and the line joining the means of the columns 
are shown. The regression departs materially from linearity, 
and the scatter about the straight line of regression is much 
greater than the scatter about the line joining the means. 

The relation between r and ij affords a convenient test 
of linearity in a given instance, since the two values will 
be identical when the regression is strictly linear, and will 
differ the more widely the greater the departure from the 
linear form. The general test for linearity is 

f - r*. 

Even in a case of linear regression it is probable that rj 
and r will differ somewhat because of fluctuations due to 
chance alone. A material difference, as reflected in the 
magnitude ^ (zeta), indicates that a straight line does not 
describe the relationship in question and that r is not a 
suitable measure of correlation. In the example given 
above, in which rj equals .965 and r equals .793, the 
measure r has a value of .302. (The uncorrected i} is used 
in this test.) This is large enough to indicate that the regres- 
don is non-linear. 

In later sections methods of testing for linearity are 
more fuUy discussed. 
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CHAPTER XIII 


ELEMENTARY PROBABILITIES AND THE 
NORMAL CURVE OF ERROR 

Reference has been made in an earlier section to the 
family resemblance which is found among frequency distri- 
butions drawn from widely different fields. Attention was 
also drawn to a certain basic t3q)e, represented graphi- 
cally by the symmetrical bell-diaped curve, which is called 
the “normal curve,” or the “normal curve of error.” In 
an earlier day this curve was looked upon as representing 
a fundamental law which described all distributions of 
quantitative data. From the modern standpoint this w'as 
quite an erroneous conception. The normal curve is viewed 
today as but one of a number of types of cmwes w'hich 
may be used to describe frequency distributions. It is, 
however, by far the most important type. For m^y of 
the m easiuem ents used to describe distributions of observa- 
tions (inejasurements such as the m,ean, tKe standard devia-: 
tion, th e coefficient of yariatipp) are distributed in aecord- 
ahce witlTthjs normal law of error. The procedures employed 
in generalizing results obtained from the study of samples 
and, in particular, in determining the reliability of such 
generalizations, lean heavily upon this law. An under- 
standing of the characteristics of the normal curve is 
essential to the statistician. 

Elementary Theorems in Probability 

We may approach this subject by a brief consideration 
of certain elementary principles of probability that enter 
into many forms of statistical work. A detailed explanation 
of the theory of probability would carry us beyond the 
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limits of the present volume. The treatment which follows 
is presented only as an introduction to the subject, designed 
to illustrate, by simple numerical examples, the relation 
between the principles of probability and the normal law 
of error. 

In this argument we may use the following standard 
notation. If an event can occur in n ways, a of which are 
to be considered as successful and h as unsuccessful, the 
probability p of a successful outcome may be written 

a 

p = - 

and the probability g of an unsuccessful outcome may be 
written 



Since the sum of the favorable and unfavorable outcomes 
is equal to the total number of events, we have 

ffl -f- fe = 71. 

Dividing by n, 

2 + ^-1 

n n 

so that 

p + 3 = 1 

or certainty. 

A probability, therefore, may be written as a ratio. The 
numerator of the fraction corresponding to this ratio repre- 
sents the number of favorable (or unfavorable) outcomes, 
while the denominator represents the total number of 
possible outcomes. 


EXAMPLES OF SIMPLE PROBASILITIES 


If a coin be tossed, the turning up of a head being looked 
upon as a favorable outcome, we have, as the probability 
of a success. 
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and of a failure, 



If we roll a die, regarding a six spot as a favorable outcome, 

1 


and 


9 


6 

6 ‘ 


If a card be drawn from a pack of 52 the chance of drawing 
the ace of spades is sV, of failing in that endeavor, 


THE ADDITION OE PROBABILITIES 

What is the chance of securing either an ace of spades 
or a two of spades in a single draw from a pack of 52 cards? 
In such a case, where any one of several outcomes will be 
considered as favorable, the probability of a success is 
the sum of the separate probabilities. In this example 

1.1 1 
^ 62 62 “ 26‘ 


The chance of drawing either a heart or a spade from a 
pack of playing cards is given by 


P = 


13 , 13 _ 1 
62 62 “ 2 


THE MULTIPLICATION OP PROBABILITIES 

Two events are said to be independent when the outcome 
of one does not affect the outcome of the other. Thus the 
result of one throw of a die does not, presumably, affect 
the result of the next to®. The probability of a compound 
event (i.e., fhat two events, independent of one another, 
will both occur) is the prodiict of the probabilities of the 
separate events. Thus the chance of secunng an ace. 
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followed by a two spot, in two successive throws of a die, 
is given by 

P ~ 6 ^ 6 ~ 36 

In computing the probability of a given outcome it is 
frequently necessary both to multiply and to add probabili- 
ties. For example, we wish to determine the chance of 
securing the total 6 from two dice thrown simultaneously. 
We may label the dice a and 6 to distinguish them. This 
total may be secured from any one of the four following 
combinations: 

Die a Die 6 

1 4 

2 3 

3 2 

4 1 

The chance of securing an ace with die o is i, of secur- 
ing a 4 with die 6 is i. The chance of the two in combi- 
nation is -sV- Similarly, the probability of each of the other 
three' combinations is But any one of these four re- 
sults will give a total of 6, and wiU be considered success- 
ful. Hence 


We have in this example answered the question: What 
is the probability of securing exactly 6 m the toss of two 
dice? We might put the question: What is the chance of 
securing ai least o in the toss of two dice? In this case a 
total of 5 or more will be considered a favorable outcome. 
Just as in the preceding example, we may work out the 
probability of securing each of the results which will be 
accepted as successful. The following summary indicates 
the probability of each of these totals: 
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Probability of throwing 12 with 

two dice 

1 

36 

it ti a 

11 

it 

ti ti 

2 

“36 

it it 

10 

ti 

it tt 

3 

36 

it it IL 

9 

it 

it ti 

4 

36 

ti it it 

8 

ti 

ti it 

5 

36 

tt it tt 

7 

ti 

tt It 

6 

36 

tt it ti 

6 

it 

tt it 

5 

tt tt tt 

5 

tt 

tt tt 

4 

36 

Sum of above probabilities 

30 

36 


The chance of throwing at least 5 in the toss of two dice is, 
therefore, M or f. 

The Binomial Expansion and the Measubbmbnt op 
Pbobabilities 

It is possible to express these facts in a generalized form. 
A simple illustration may be employed to exemplify the 
derivation of the general expression. 

If two coins are tossed simultaneously there are four 
possible outcomes 

a b ah a b a b 
TT TH HT HH. 

(The two coins are represented, respectively, by the letters 
a and &.) The chances of securing no heads, one head, and 
two heads axe, respectively, i-, i, and i. If three coins 
(represented by the letters, a, 6, and c) are tossed simnl- 
taneously, we have ei^t possible outcomes 

ab c ab c a b e ah c a b c a b c a b c a b c 

TTT TTH THH THT HTT HTH EHT HHH. 
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The chances of securing no heads, 1 head, 2 heads, and 
3 heads are, respectively, i, I, f, i- 
But these results may be derived without working out 
the separate probabilities in detail. We have employed 
p and g to represent, respectively, the probability of success 
and failure of a given event. If there are two independent 
events the compound probabilities are given by the expansion 
of the expression 

(p + ?)*. 

For the case in which p (e.g., the probability of throwing 
a head) =2 = 1, the probabilities of the various results 
are given by 

Q + 0* = | + | + |- 


These are the results secured in the first example cited 
in this section. If there are three independent events, 
with p = 2 = §> we have 



-ij.? , 3 , 1 
8 8 8 8 ’ 


the probabilities secured in the second example. 

If we wish to know not the separate probabilities but the 
probable frequencies of the various outcomes in a given 
number of trials, these may be computed from the expression 


N(p + g)“ 

where N represents the number of trials and n the ntunber 
of independent events. Thus if there are 200 trials and 
there are two independent events, the probable frequencies 
are given by 

200(p + 2)* = 200(p* + 2p2 + 2*)- 

With p — g — h this gives us 

20o(i) + 2Oo(0 + 2000 = 50 + 100 + 50 
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which indicates the probable frequencies of 2 successes, 
1 success, and no successes. 

If there are three independent events, the probable 
frequencies in N trials are determined from the binomial 
expansion of 

N{p + qy. 

If N equals 200, we have 

200(p* + + 3pg® + g*). 

If p equals i, we have 

Ko + m) + m) + Ks) 

These terms indicate, in order, the probable frequencies 
of 3 successes, 2 successes, 1 success, and no successes. 
The total frequencies secured by carrying through the 
process of multiplication will be equal to the number of 
trials, for all possible outcomes are covered by the expansion. 

Thxis, when we know in advance^ the probabilities attach- 
ing to similar but independent events, we may determine the 
probable frequencies of any given number of successes or 
failures. This is true whether p and q be equal or unequal. 
It is necessary only that p and q remain constant. There 
is here a fact of great significance in the development of 
statistical theory. 

A COMPARISON OP ACTUAL AND THEORETICAL FREQUENCIES 
IN THE REALM OP PURE CHANCE 

Certain points of importance may be made clear by 
comparing some experimental results with the theoretical 
frequencies given by the binomial expansion. Twelve dice 

1 A distinction is generally drawn between a priori probabilities of the type 
described above, and empirical probabilities, Imowledge of which is derived 
from observation or experience. As an example of the latter type we have, 

74 173 

as the probability that a man aged 35 will live 10 years, the ratio This 

is based upon the American Experience Table of Mortality which shows that 
of 81,822 men living at the age of 33, there are 74,173 living ten years later. 
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were thrown a number of times. Each 4, 5, or 6 spot 
appearing was considered to be a success, while a 1, 2, 
or 3 spot was a failure. (In a typical throw we might have 
the following spots up: 3, 1, 5, 1, 2, 4, 4, 6, 3, 2, 3, 5. In 
this lot there are five successes, and the result is so tallied.) 
In a classical example recorded by W. F. R. Weldon^ 
twelve dice were thrown in this way 4,096 times, a success 
beiug defined as above. The results are recorded in column 
(2) of Table 107, and the distribution is shown in Fig. 84. 
By computation we find the arithmetic mean and the 
standard deviation of this distribution to be, respectively, 
6.139 and 1.712. 

Let us compare with these results those which we might 
expect from the given conditions. Twelve dice were thrown 
each time, hence we are dealing with 12 independent 
events. There were 4,096 trials. Since either a 4, 5, or 6 is 
considered a success, p = q = i . 

For the terms in the binomial expansion we have 


(? + ?)" = ?* + np'‘~'^q 


n{n — 1) 
1-2 


p»-2g2 


w(ra-l)(ra-2) 

1 - 2-3 ^ ® 


In the present case we have 

Expanding 

/ 1 12 66 , 220 495 , 792 . 924 

’ \4,096 4,096 4,096 4,096 4,096 4,096 4,096 

, 792 495 220 66 12 , 1 \ 

4,096 4,096 4,096 ■*" 4,096 4,096 4,096/ 

Completing the indicated multiplication we have the theo- 
retical frequencies of the various possible successes in 
4,096 throws of twelve dice. These are shown in column 
(3) of Table 107. 


1 CSted by F. Y. Ek^eworth, Eiuyd. BrU., 11th ed., Vol. XXII, 394. 
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Table 107 

Comparison of Actual and Theoretical Frequencies in Dice-Rolling 

Experiment 


( 1 ) 

( 2 ) 

( 3 ). 

Number of 

Observed 

Theoretical 

successes 

frequencies 

frequencies 

0 

0 

1 

1 

7 

12 

2 

60 

66 

3 

198 

220 

4 

430 

495 

5 

731 

792 

6 

948 

924 

7 

847 

792 

8 

536 

495 

9 

257 

220 

10 

71 

66 

11 

11 

12 

12 

0 

1 


4,096 

4,096 


The distribution of the theoretical frequencies is shown 
in Fig. 84, with that of the observed frequencies. The 
relationship of the two distributions is close. 

When we have, as in this case, a knowledge of the 
probabilities involved, it is possible to determine the arith- 
metic mean and the standard deviation of the distribution 
of the theoretical frequencies. As a general expression for 
the mean number of successes, where the number of inde- 
pendent events and the probability of success are known, 
we have 

M = np. 

Applying the present values, 

M = 12 X i = 6. 

The mean, as computed from the observed frequencies, is 
6.139. 
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As a general expression for the standard deviation,^ under 
the same conditions, we have 

(T = Vnpq. 

In the present case 

ff = ;|/l2xix2 = V3 
= 1.732. 



Fig. 84. — A Comparison of Actual and Theoretical Frequencies in a 
Dice-Rolling Experiment 

The standard deviation, as computed from the actual 
frequencies, is 1 . 712. 

^i^en proportions, or relative frequencies, are dealt with, 
the standard deviation (<r') may be derived from the relation 



^ This formula for the standard deviation of a binomial distribution is of 
central importance. The derivation of this formula, and that for the mean of 
a binomial distribution, are given in Appendix B. 
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The Normal Ctirve op Error 


We may return to a consideration of the curve in Fig. 84 
which represents the theoretical frequencies in the dice- 
throwing experiments. It is a perfectly sjimnetrical 12-sided 
polygon, the number of sides (excluding the base) corre- 
sponding to the number of independent e^^ents in the particu- 
lar problem considered. With six events we should have a 
six-sided figure,, with twenty events a twenty-sided figure, 
and so on. It is obvious that, as n increases, the nmnber 
of sides to the polygon increasing correspondingly in num- 
ber, the graph representing the expansion of the binomial 
(P + ?)” approaches more and more closely a smooth curve. 
Ti^th n infinitely large a perfectly smooth curve would be 
secured. This is the normal curve of error whidi has been 
plotted in Fig. ,85. 

"The equation to this curve is written in several forms, of 
which 


»* 



is one. In this equation j/,, the TnayiTniiTn ordinate, is a 
constant; e is a constant (the base of the Napierian loga- 
rithms) having a value of 2.71828; cr represents the stand- 
ard deviation; and a; is a given value of the dependent 
variable expressed as a deviation from the mean. The maxi- 
mum ordinate may be derived from the relation 

N 


hence the equation to the normal curve may be written 


N - 
:e 




a? 

2<r> 


where x is the constant 3 . 14159. 

This equation may be derived in several ways.^ One 

^ Gauss’ deduction of the error equation may be found in ail standard works 
on the theory of least squares. Gf . references at end of Appendix A. 
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procedure which throws light on the phs^cal conditions 
giving rise to the emergence of a normal distribution, starts 
from three basic assumptions. 

1. The causal forces affecting individual events are 
numerous, and of approximately equal weight. 

2. The causal forces affecting individual events are 
independent of one another. 

3. The operation of the causal forces is such that devia- 
tions above the mean of the combined results are balanced 
as to magnitude and number by deviations below the mean. 

A great part of the power which modern statistical 
technique possesses is derived from the detailed knowledge 
of the characteristics of the normal or Gaussian curve. 
From prepared tables showing the fractional parts of the 
total area under the curve lying between ordinates erected 
at stated distances from the mayimu-m ordinate, theoret- 
ical frequencies may be determined much more readily 
than by the laborious method based upon the binomial 
expansion. 

USE OF A TABIiB OF AUEAS UNDEB THE NORMAIj CXTBVE 

The entire area xmder a frequency curve is taken to 
represent the total number of frequencies^ Giyep inf orm a, tinu 
as to the proportion of the total area within a given seg^nt,. 
it w6uld”Be~ e^y to compute the frequencies repr^ented 
by this segment, or to detsmoine . the... probability, .th^.a 
given observation from the population repr^ented by the 
curve would fall within the limits of this segment. Prepared 
tables of the probability integral, of which Table 108 is an 
example, serve just this purpose, with respect to the normal 
curve. (A more detailed table than that here given is 
needed for accurate computation. Appendix Table I will 
serve most purposes.^) 

^ Tables of areas under the normal curve, as calculated by Dr. W. F. Shep- 
pard, are available in many publications. Cf . Tables for StaHstickma and Bio- 
metricions^ edited by Karl Pearson, Biometric Laboratory, University College, 
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Table 108 

Area of the Normal Curve in Terms of .46sctsso 
(Giving fractional parts of the total area between yt and ordinates erected 
at varying distances from y^) 


xla 

a 

xia 

a 

0 0 

00000 

2 0 

.47725 

0 1 

03983 

2.1 

48214 

0 2 

07926 

2 2 

48610 

0 3 

11791 

2.3 

48928 

0 4 

.15542 

2 4 

49180 

0 5 

.19146 

2.0 

49379 



2 5758 

49500 

0 6 

.22575 

2.6 

.49534 

0 7 

.25804 

2 7 

.49653 

0.8 

.28814 

2.8 

.49744 

0 9 

.31594 

2.9 

.49813 

1 0 

.34134 

3 0 

.49865 

1.1 

.36433 

3.1 

.49903 

1 2 

.38493 

3.2 

.49931 

1.3 

.40320 

3 3 

.49952 

1.4 

.41924 

3 4 

.49966 

1,5 

.43319 

3 5 

.49977 

1 6 

44520 

3 6 

.49984 

1 7 

45543 

3 7 

.49989 

1.8 

.46407 

3 8 

.49993 

1 9 

.47128 

3 9 

.49995 

1 96 

.47500 

4 0 

.49997 


Since the normal cxnve is symmetrical about the maxi- 
mum ordinate, the values given in Table 108 apply to 
observations on both sides of the mean. In using such a 
table, deviations from the mean are first e^ressed in units 
of the standard deviation. (The term normcd deviate is 
applied to such a quantity, that is, to a deviation from the 
mean of a nonnal distribution ,e^>resafid.jua..units-a£-±he. 
standard devi ation of that, distribution.) The proportion 

London; TcMea of Applied MaOiemaUcs, J. W. Glover, Ann Arbor, Michigan, 
George Wahr; Manual of PrMems and ToMes in Siatielics, F. C. Mills and 
P. H. l^venport. New York, Hemy Holt and Co. 
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of the total area l3^g between any two ordinates may then 
be readily determined. For example: What proportion of 
the cases in a normal distribution lies between the maximum 
ordinate and an ordinate erected at a distance from the 
mean equal to + <r? Reading down the x/ff column to 1.0, 
we find the value .34134 opposite it. This, in ratio form, 
is the proportion of cases falling within the limits indicated. 



Flo. 85. — An Illustration of the Measurement of Areas under the Normal 

Curve 

Expressing this ratio as a percentage, we have 34.134 per 
cent as the answer to om question. 

Fig. 85 shows the relation of this area (the shaded area A) 
to the total area under the curve. 

What proportion of the total number of cases in a normal 
frequency distribution will fall between an ordinate erected 
at a distance from the mean equal to — 1.4(r and one 
erected at — 2ff? From the table we find that 41.924 per 
cent of the total area will lie between y# and the ordinate 
at — 1.4o-; 47.725 per cent will lie between y, and the 
ordinate at — 2<r. The difference, 5.801 per cent, will fall 
between the ordinates at — 1.4<r and at — 2<r. This may 
be converted into actual frequencies by taking this propor- 
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tion of the total number of eases in the given distribution. 
The shaded segment B in Fig. 85 represents the area thus 
marked off. 

For certain purposes we wish to know the proportion 
of the total number of eases deviating by a stated amoimt 
or more in either direction from the mean of a normal 
distribution. K we wish to know the proportion of all 
cases deviating from the mean by 1 . 96<r or more, we must 
add to the area between + 1.96<r and the upper limit of 
the cmve the area between — 1 .96<r and the lower limit of 
the curve. Each of these areas equals .50000 — .47500, 
or .025. The percentage of cases deviating from the mean 
by + 1.96<r or more is 2.5; the percentage deviatii^ by 
— 1.96<r or more is 2.5. The percentage deviating above 
or below the mean by 1.96<r or more is 5.0. Similarly, 
it may be determined from the entries in Table 108 that 
just one per cent of all the cases in a normal distribution 
will deviate from the mean, positively or negatively, by 
2.5758<r, or more. This “one per cent” area is represented 
by the sum of the shaded portions at the two tails of Fig. 85. 
The ordinates defining the inside limits of these segments are 
erected at + 2.5768<r and at — 2.5758(r, while the outer 
limits are at infinity. 

Special significance attaches to the two limits last men- 
tioned, because of the uses made of them in interpreting 
errors of sampling. This topic is developed at a later point. 
Here we may note that the figures defining proportions of 
the total area under the normal curve falling in given 
areas may also be interpreted as probabilities. The probar 
bility that a given observation, made at random in a popu- 
lation distributed according to the normal law of error, 
win fall between the mean and a value one standard devia- 
tion above the mean is .34134; the probability that a given 
observation will deviate from the mean by 1.96<r or more 
is .05; the probability that a given observation will deviate 
from the mean by 2.5758o' or more is .01. 



440 THE NORMAL CURVE OF ERROR 


The method by which probabilities of occurrence may 
be determined from a table of areas under the normal 
curve, and by which the significance of a given normal 
deviate may be established, should be clearly understood. 
These methods enter in many ways into the work of a 
statistician. 

The uses of the normal curve of error, and of the table 
of areas based upon the integration of this curve, are too 
varied to be enumerated at length here. A simple example 
may serve to introduce the subject. 

An Economic Application 

The statistical division of the American Telephone and 
Telegraph Company has made a study of the annual 
message use of four-party line residence message rate sub- 
scribers in Buffalo. The annual messages for each of 995 
subscribers were tabulated and classified.^ The results, 
together with certain computations, appear in Table 109. 


THE MOMENTS OP A PEEQUBNCY DISTBIBXrTION 


Some terms and symbols that have not been employed 
heretofore may be introduced at this point. We may 
write, using v (nu) to define certain quantities of interest 
to us. 


Vi = 


■ mA 


N 


* N 

* N 

* N 


= first moment of the distribution about the arbitrary 
origin, 

= second moment of the distribution about the arbi- 
trary origin, 

= third moment of the distribution about the arbi- 
trary origin. 

= fourth moment of the distribution about the arbi- 
trary origin. 


1 “Introduction to Frequency Curves and Averages.’* Statistical Bulletin^ 
Statistical Methods Series^ No, 1, Issued by Chief Statistician, American Tele- 
phone and Telegraph Co, 
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Table 109 

Annual Message Use of 995 Telephone Subscribers 
(Qlustratmg the computation of the moments of a frequency distribution) 


(1) 


(2) 

(3) 

(4) 

Devialion 

(5) 

(6) 

(7) 

(8) 

Interval 
of rrmBage 
use * 

Midr 

■point 

Fre- 

quency 

from arbi- 
trary origin 
in dass-in- 









terval units 







m 

f 

x' 

fx' 




0- 

50 

25 

0 

- 10 

0 

0 

0 

0 

50- 

100 

75 

1 

- 9 

- 9 

81 

- 729 

6,561 

100- 

150 

125 

9 

- 8 

- 72 

576 

- 4,608 

36,864 

150- 

200 

175 

19 

- 7 

- 133 

931 

- 6,517 

45,619 

200- 

250 

225 

38 

- 6 

-228 

1,368 

- 8,208 

49,248 

250- 

300 

275 

50 

- 5 

- 250 

1,250 

- 6,250 

31,250 

300- 

350 

325 

95 

-- 4 

- 380 

1,520 

- 6,080 

24,320 

350- 

400 

375 

85 

- 3 

- 255 

765 

- 2,295 

6,885 

400- 

450 

425 

115 

- 2 

-230 

460 

- 920 

1,840 

450- 

500 

475 

132 

- 1 

- 132 

132 

- 132 

132 

500- 

550 

525 

144 

0 

0 

0 

0 

0 

550- 

600 

575 

116 

1 

116 

116 

116 

116 

600- 

650 

625 

79 

2 

158 

316 

632 

1,264 

650- 

700 

675 

54 

3 

162 

486 

1,458 

4,374 

700- 

750 

725 

31 

4 

124 

496 

1,984 

7,936 

750- 

800 

775 

11 

5 

55 

275 

1,375 

6,875 

6,480 

800- 

850 

825 

5 

6 

30 

180 

1,080 

850- 

900 

875 

6 

7 

42 

294 

2,058 

14,406 

900- 

950 

925 

2 

8 

16 

128 

1,024 

8,192 

950-1,000 

975 

1 

9 

9 

81 

729 

6,561 

1,000-1,050 

1,025 

1 

10 

10 

100 

1,000 

10,000 

1,050-1,100 

1,075 

1 

11 

11 

121 

1,331 

14,641 




995 


-956 

9,676 

- 22,952 

283,564 


“Moment” is a familiar mechanical term for the measure 
of a force with respect to its tendency to produce rotation. 
The strength of this tendency depends, obviously, upon the 
amoimt of the force and the distance of the point at which 
the force is exerted from the origin. The term is used in sta- 

* As here classified an item having a value of 50 was put in the class having 
50 as an upper limit. Items falling on other class limits were similarly disposed 
of. 
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tistics in a quite analogous sense, the class-frequencies being 
looked upon as the forces in question. The size o f ^ach nla-ss- 
frequency and the distance o f -midpomt . from th a 

origin are the factors of prime i'mpnrt.fl.nce in this respect . 
The moments of a distribution about any origin may be 
computed by multiplying the frequency of each class by 
a given power of its distance, along the x-axis, from the 
origia, summing the resulting products and dividing by 
the number of cases. If the jSrst moment is desired, the 
first power of the x-distance is employed; if the fourth 
moment, the fourth power of the x-distance, etc. The 
subscripts indicate the moments represented by the various 
symbols. 

The most significant moments, for statistical pmposes, 
are those which relate to the arithmetic mean as origin. 
Representing these moments by x (pi)^ we have the 
following relationships: 

First moment about the mean = xi = 0. 

Second “ “ “ “ = t* = v* - vi*. 

Third “ “ “ “ = xs = v, - Sj'iJ's -|- 2vi». 

Fourth “ “ “ “ = iri = Vi — 4^iVs -f — Zvi*. 

The com p utg-tinn nf +>10^10 m aw finutn from the data, as 
classified, i nvolves the flasu-m ptinn that, t.hft in 

class can be treat e tl thmigh thQy. . jiMrQ pnTifiontrotPfl Qt 
thfi midnn int, nf tbn.t. nlflflR, It has been established that, 
under certain conditions, calculations made on this assxunp- 
tion are subject to a constant error. In particular, it has 
been shown that the values of the second and fourth 
moments are not the same, when computed from grouped 
data, as when computed from imgrouped data. 

JW F S1iAppnr/lg hfla..wf>i-lrftd niit ftfi rtflin nnirectionS for 
this biflg- His corrections may be applied when two 
conditions prevail: 

1 In the equation to the nonnal curve v represents the familiar constant, 
3 . 14159. As a symbol for a moment about the mean it relates, of course, to 
no such constant value. 

* Cf. Proceedings of {he London Mathematical Society t Vol. XXEX, 353--380. 
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(1) When the distribution relates to a continuous variable. 

(2) When the frequency curv'e is characterized by “hi^ 
contact,” i.e., when the frequency curv'e tapers off gradually 
in both directions. 

The symbol n (mu) is employed to represent a corrected 
moment about the mean. The application of Sheppard’s 
corrections gives us the following final formulation: 

Hi = 0 

1 

Hi iTa 

Hi = ‘’Tt 

(In applying the corrections and shs, the correspond- 
ing decinoal values, 083333 and .029167, will generally 
be employed.) It is assumed in making these corrections 
that a class-interval unit has been employed in measuring 
deviations from the mean. 

It may be noted in passing that the standard deviation is 
the square root of the second moment about the mean. For 
the uncorrected value, 


a = -s/irt. 

If Sheppard’s corrections^ are to be applied 

(T = V/te. 

The calculation of the moments of the frequency distribu- 
tion of telephone subscribers is shown on page 444. .Shfipz. 
card’s corrections are appl ied, sin ce th e curve is marked by 
reasonahlv high contact. It is a discontinuous'distribution, 
but the unit (1) is so small in comparison with the range 
that it may be treated as continuous. 

^ It should be noted that these corrections, when appropriate, are applicable 
to the standard deviations entering into the calculation of the coefficient of 
correlation. 
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i-5=- -96*^804 

"* = “ “ 23.067337 

V4 = = 284.988946 

Ti = 0 

X* = ^2 - Pi» = 9.724623 - .923144 = 8.801479 
jrs = V3- 3 viP2 + 2pi» = - 23 . 067337 + 28.030370 

- 1.773922 = 3.189111 

Ti = V4~ 4 j>iV3 + 6viH/a — 3vi* 

= 284.988945 - 88.662760 + 63.863384 - 2.666586 
= 247.642983 

Ml = 0 

^ i = 8.801479 - .083333 = 8.718146 

l^St 

/is = TTs = 3.189111 

fi4 =T4-^7ra + ^ = 247.642983 - 4.400739 + 029167 
= 243.271411 


CBITBRIA OP CURVE TYPE 

Haviuft these.,\':alueg.^.w:a .niay return. to a conaideraiioa 
of the main prob lgigj_th.e utiliza^^^ pf niir Jk nn wlArlgp. . nf. 
the nor md curve. There axe Rfirtg^in nriteria. r epresented, 
hy thfi IfitterR p (beta,) anfi w fkaippar) , Trhirh enable, .ua 

to H fitfiTTininA ronHily whAt.hpr n 

de scribed b v a curve_of the nor mal type. These may be 
derived from the corrected moments of the given distribution. 


a = fiL* = 10-170429 
M2* 662.632015 


= .01634853 


g _M4 _ 243.271411 
M2* 76.006070 


3.200683 


/3i(i82 + 3)» 

4(4ft - 3/30(2)32 - 3/3i - 6) 
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^ _ .01534853 X 38 448470 _ .5901275 
4(12. 756686) (.355320) 18 130823 

Ki = .032548 

For the normal curve these criteria have the following 
values : 

A = 0 
A = 3 
Kg = 0 

We may conclude, tentatively, that the normal curt'e 
may be used to describe the given distribution.^ 

Fitting a Noemal Curve; Use of a Table of Areas 

The process of fitting a normal curve to a set of .observa-;;, 
tiong involves the computation of theoretical frequencies 
corresponding to the observed frequencies. This may. Jae 
done from a t able of areas u nder the- Bprmal cur ve (see 
Appendix Table I). Using such a table, in the manner indi- 
cated in the preceding section, the areas between the maxi- 
mum ordinate and ordinates erected at the various class 
limits may be determined. By the simple process of subtrac- 
tion the area within each class, and hence the theoretical 
frequencies, may then be computed. The procedure is illus- 
trated in Table 110 on page 446, relating to the distribution 
of telephone subseribCTs. 

The theoretical distributions derived from this fitting 
process may be compared with the observed frequencies, 
as given in Table 109. Or the comparison of the actual 
distribution and the fitted curve may be made graphically, 
as in Fig. 86. It is apparent by inspection that the normal 
curve ^ves a fairly good fit to the data, althou^ there 
are several classes in which the differences are marked. A 
natural question arises as to the reason for the failure of 
the normal curve to fit at all points. There are two possible 

1 Aooount is later taken of the bearing of errors of sampling on this con- 
clusion. See Chap. XVIII. 
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Table 110 


Illustrating the Computation of Theoretical Frequencies from a Table 

of Areas 


(1) 

(2) 

(3) 

(4) 


(5) 


Deviation 

Proportion of 

Number of 
cases between 





Class 

from 

area between 

Vo and ordi- 

Theoretical frequencies, 

limit 

mean 

and ordinate 

nate 

by classes 


X 

at% 

atl 





<T 


a 




0 

-3.23 

.4993810 

496.88 




50 

-2.89 

.4980738 

495 58 

0- 

50 

1.92* 

100 

-2.55 

.4946139 

492 14 

50- 

100 

3.44 

150 

-2.22 

4867906 

484 36 

100- 

150 

7.78 

200 

- 1 88 

.4699460 

467 60 

150- 

200 

16.76 

250 

- 1 54 

4382198 

436 03 

200- 

250 

31.67 

300 

- 1 20 

3849303 

383.01 

250- 

300 

53 02 

350 

- 86 

3051055 

303.58 

300- 

350 

79 43 

400 

- .52 

.1984682 

197.48 

350- 

400 

106.10 

450 

- 18 

0714237 

71 07 

400- 

450 

126.41 

500 

+ .16 

.0635595 

63.24 

450- 

500 

134.31 

550 

-f .495 

. 1896931 

188 74 

500- 

550 

125 50 

600 

+ 83 

.2967306 

295 25 

550- 

600 

106.51 

650 

+ 1 17 

3789995 

377.10 

600- 

660 

81.86 

700 

-f 1 51 

.4344783 

432.31 

650- 

700 

56.21 

750 
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answers to such a question. 

The failure to fit may 

be due 


merely to chance fluctuations such as are found in any 


sample. We may have an underlying law of distribution 
of residence subscribers, classified by message use, which 

* The theoretical distribution shows .62 of a case bdow — 3.23ir. To pre- 
serve fonnal consistency this amount has here been added to the theoretical 
frequency between 0 and 50. 
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accords perfectly with the normal law of error, but the 
particular sample selected may be marked by certain 
irregularities which would be ironed out if a very large 
number of cases were included. On the other hand, the 
differences ma y be due to the fnndflmftntfll failure o f such a 
distri bufiacTto accord with the normal law of er ror. Such 
a law may not describe the distribution of telephone calls, 
in which case the normal curve should not be employed. 

At this stage we may note, without discussion, that the 
differences between theoretical and observed frequencies in 



Number of Messages r /f’^ 

Fig. 86. — Illustrating the Fitting of a Nonnal Curve to Frequency 
Distribution of Telephone Subscribers, Classified according to Message 
Use 

the present example are small enough to be attributed to 
chance fluctuations of sampling. The reasoning that sup- 
ports this conclusion is presented in a later section (Chapter 
XVIII). The evidence is clear, however, that the discrep- 
ancies between the observed frequencies and those in the 
corresponding normal distribution are not excessively large. 
The observed facts are not incon^ tent with the hvnothesis 
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rgMdential tele phone subscribers, . cl^s^eji aofiordmg 
to frequency of t elep h one usej_ are distri but ed_ro.a!eCQgdance. 
with_t^nfiEff)taL law.of 

This conclusion gives generality to the results of our 
study. We have a great deal of information concerning 
the attributes of distributions following the normal law 
of error, and once the identification of an actual distribu- 
tion with this standard type has been effected we may 
draw upon this store of knowledge. In using the original 
frequency table we are limited to the classes there estab- 
lished. We may now go beyond this and determine how 
many cases may be expected within stated limits. We may 
compute the probability of a case falling between any two 
points on the rc-scale, or above or below any given value. 
The observed results, standing alone, are restricted in their 
significance to the particular observations recorded, but 
the theoretical frequencies have no such limitations. They 
apply generally, to the entire population from which the 
sample was drawn. In so far as we are assured of the repre- 
sentative character of our sample we have a basis for 
inference that would be afforded by no amount of study 
of the particular distribution as a thing apart. This fact, 
that a knowledge of the theoretical frequencies permits 
generalization beyond the limits of direct observation, is 
perhaps the most important of the advantages derived 
from the identification of an actual distribution with an 
ideal type, such as the normal distribution.^ 

NOTE ON THE DESCMPTION OF THE FHEQUENCT DISTEIBUTION 

With the aid of the criteria explained in this chapter it is possible 
to describe a frequency distribution more accurately than is possible 
with the measurements employed in the earlier chapters. A treat- 

* As was stated, the nonnal curve is but one t^e of frequency curve, though 
one of basic importance. A comprehensive ^etem of frequency curves is that 
associated with the name of Sari Pearson, who has derived equations to and 
has described in detail a number of standard types. An account of other 
f undam ental types will be found in the books by Ame Fisher referred to at 
the end of this chapter. 
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ment of this subject is beyond the scope of the present book, but it 
seems advisable to indicate briefly the nature of these additional 
measures. 

The value of jSj serves as a measure of the degree of “flat-topped- 
ness” found in a given curve. If ft = 3, as in the normal type, the 
curve is said to be mesokurtic. If ft < 3 the curve is phdykurtic, or 
flatter than the normal type. If ft > 3, as in the example given 
above, the curve is leptohirtic, or more peaked than the normal. 

A measure of skewness which is more accurate than those given 
early in the book may also be computed from these criteria. 
Karl Pearson has shown that the quantity 

= + 3) 

^ 2(6ft - 6ft - 9) 

serves as a measure of the degree of asymmetry of a given curve. 
Inserting the values of and ft given above we have, in the case 
of the distribution based on message use, 

X = - .05658. 

(x is positive if the mean is greater than the median, negative if 
the mean is less than the median. In the present case the value of 
the mean is 476.96, that of the median is 482.39, hence the 
skewness is negative.) 

Finally, the distance, d, between the mean and the mode may 
be determined from the relation 

d = X X (T. 

In the distribution described above (relating to telephone use) <r, 
in original units, equals 147.65. Hence 

d = - .05558 X 147 66 = - 8.21. 

Since 

Mo = M-d 

we have 

Mo = 476.96 + 8.21 = 485.17. 

This gives a truer approximation to the modal value than any of 
the methods discussed in Chapter lY. 

The methods exemplified in Table 109 and the accompanyii^ 
text provide, therefore, a strmghtforward procedure for the meas- 
urement of the essential attributes of a frequency distribution. 
The mean and mode as measurements of central tendency, the 
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standard deviation as a measure of dispersion, x as ^ measure of 
skewness, and ft — 3 as a measure of the degree of concentration 
of observations near the point of maximum frequency, may be 
computed directly from the first four moments of a distribution. 
These methods are available, of course, whether or not a study is 
to be carried to the point of determining and fitting a frequency 
curve of an appropriate ideal t3rpe. They are to be recommended 
for use in any systematic study of frequency distributions. 
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CHAPTER XrV 


STATISTICAL INDUCTION AND THE PROBLEM 
OF SAMPLING 

The preceding pages have been devoted to an account of 
tools employed in statistical analysis. Examples illustrating 
the application of these tools to specific problems have 
been presented, but the emphasis throughout has been on 
technique. It is appropriate at this point that we stand off 
a distance, enlarging our perspective, and consider certain 
general problems relating to the application of these tools. 
What is their proper place in economic and business research? 
What are the assumptions involved in using them and 
what are their limitations? What are the end products 
of statistical analysis? How valid are the conclusions 
reached? What restrictions attach to such conclusions? 
We must give thought to such questions as these, if statistical 
methods are to be intelligently applied. 

Statistical Description and Statistical Induction 

In approaching this subject we must first make clear 
the distinction between statistical description andMcitisLical 
indyfitiim. By employing the methods of statistics it is 
possible, as we have seen, to desorib^_Rnnfimetly a mass 
of quantitative data. Hundreds or thousands of individual 
cases may be classified, and a frequency distribution formed. 
The essence of this distribution may be boiled down to 
perhaps four measures — of central tendency, variation, 
skewness, and kurtosis. A tremendous gain has been realized 
in thus replacing the multiplicity of individual cases by a 
limited number of measures that define the characteristics 
of the group as a whole. The possession of such tools makes 

452 
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it possible for our limited powers of perception to, grasp 
the significance of facts in the mass. Again, the-methods 
of statistics enable us to describe relations between variable 
quantities. By securing the equation to an appropriate 
curve fitted to the data by matheruatical methods, we 
may determine how much, on the average, one quantity 
changes in value as a related factor varies. This may be 
supplemented by a measure of the scatter or dispersion 
about the fitted curve, and by a measure, in abstract 
terms, of the degree of correlation between the dependent 
and the independent variables. 

In so jar as the resvUs are conjined to the cases actually 
stiuiied, these various statistical measurements are merely 
devices for describing certain features of a distribution, or 
certain relationships. Within these limits the measures 
may be used with perfect confidence, as accurate descrip- 
tions of the given characteristics. But when we seek to 
extend, these results, to generalize the conclusions, to apply 
t hem to. ..cases., not. ipLcluded in the, original study, a quite 
new s et of problems is fac ed. 

The logical process by which one arrives at generalizations 
from a study of particular cases is termed induction, as 
opposed to deduction, which involves the drawing of special- 
ized conclusions from general propositions. By st atistical 
in duction or sM isHfHiLiV'ference, is meant the generalization 
of statistical results, the application to a ‘poyruMim of 
measurements’ .deriyeid , from & . scmpU,- We are employing 
this *proceaure constantly in practical statistical work, 
though not always with a full realization of the assumptions 
inherent in that process and of the limitations attaching 
to it. 


The Nature op Statistical Induction 

The problem at issue in considering the validity of 
statistical induction may be put in the following form: 
A statistical measurement — an average, a frequency ratio, 
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a coefficient of correlation — has been derived from the 
study of sample data drawn from a given population. 
(The term “population” refers to a complete universe of 
t.bing a or phenomena having stated characteristics in com- 
mon.) May we assume that, if additional samples were 
taken from the same population, the corresponding measure- 
ments would have the same values? If not, may we deter- 
mine the approximate limits to the fluctuations to be 
expected in these measures, as derived from successive 
samples? Here, obviously, is a problem of supreme impor- 
tance. Karl Pearson has called it “the fundamental problem 
of practical statistics.” If we cannot be assured of a certain 
degree of stability in the results secured from successive 
samples it would be quite invalid to generalize from the 
examination of a limited number of cases. No weight 
would attach to any study except one covering the entire 
universe of things or phenomena composing the given 
population. Yet such all-inclusive studies of economic 
phenomena are practically impossible. Index numbers of 
prices, of wages, of living costs, equations describing the 
relation between the production and prices of given com- 
modities, coefficients of correlation between temperature 
and crop yield — all must of necessity be based on the 
study of samples. The problem of statistical inference, 
in the words of Oskar Anderson, is that of so utilizing the 
samples as to arrive at the best possible approximation 
to the characteristics of the universe. 

We have noted that statistical inference is a special 
form of a general process of reasoning, induction. Two 
points are to be emphasized concerning inductive reasoning. 

the conclusion of any induction Jholds only in terms 
of j^pbabilitms. For such a conclusion, by the very defi- 
nition of an induction, applies to cases not included in 
the observations. As opposed to deductive reasoning, in 
which the conclusion is implicit in the premises, induction 
yields a conclusion going beyond the premises. When all 
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the cases to be covered by the conclusions are included in 
the observations, the conclusion ceases to be an induction 
and becomes a descriptive statement. Accordingly, although 
induction is a highly fruitful means of adding to human 
knowledge, it is always hazardous. A leap in the dark is 
always involved, when we apply conclusions to cases not 
yet observed. 

The justification for this leap in the dark, and this is 
the sec ond poi nt we wish to stress, is found in an assumption 
that there is a ‘ ‘limita tipp^to the amotmt of independent 
varie ty” found in nature. While there is variation in 
nature, the degree of such variation is limited; there is 
some uniformity in aU natural processes. When we are 
dealing with quantitative data this uniformity in nature is 
found in the stability of large numbers, as exemplified by 
the curious regrdarities in such phenomena as birth rates 
or death rates. Nature, in other words, is pQt marked by 
utter chaos; principles of regularity, order stability 
appear in all natural processes, and these principles are 
strongly evident when we deal with masses of quantitative 
data. Therefore, when we generalize such a measure as 
an index ntunber of wholesale prices, we do so on some such 
assumption as this: It is reasonable to suppose that, in 
the larger population to which this result is to be applied, 
there exists a uniformity with respect to the characteristic 
or relation we have measured. As. a result of this ’Uniformity 
we shoifid^e^eQti .siiattetlQal.iQ^ureinents derived from 
successive samples drawn from this population to fluctuate 
within definite limits. 

It is evident that in making this assumption, in sa 3 dng 
“It is reasonable to suppose . . . we are introducing 
an hypothesis which is incapable of complete verification 
by purely statistical methods, .^ ere . .ig,.. ihufi*. jp. 
statistical induction, an a priori element.. The statistical 
eohctuabn can never stand completely on its own feet. 
It must be endorsed by reason and judgment if it is to 
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carry conviction. If a high positive coefiBcient of correlation 
were secured from the study of a sample relating to banana 
importations and sales of new life insurance, this would 
not furnish convincing evidence of a causal relation, or a 
relation of contingency, between these two variables. There 
is no reasonable basis for assuming that, in the larger 
universe of phenomena from which the sample was drawn, 
there would be laniformity with respect to this relation- 
ship. 

Statistical inference diSers from the general process of 
induction in that. a (juantitatiye .result is generalized. We 
seek to apply to a larger group — the population — the 
value of mean, standard deviation, or coefficient of correlar 
tion that has been computed from a sample. The measure- 
ment secured from the sample is an estimate of the corre- 
sponding measurement relating to the population, ^'^he 
direct task faced in such generalization is that of determining 
the limits within which these estimates would probably 
fluctuate, if based upon a number of different samples 
drawn from the same population. A number defining these 
limits will serve as a measure of the reliability of the given 
results, when generalized to apply to the population. 

We should make clear at this point the sense in which 
the term “population” is used. When we speak of a popular 
tion we are referring to an aggregate, whether of persons, 
things, or measurements, having certain common charac- 
teristics, or generated by a given system of causes. The 
term may refer to a hypothetical population from which 
a given sample may or may not have been drawn, or to a 
parent population of which a given sample is assumed to 
be representative. It may be a population of prices, or a 
population of cephalic indices; the term, ir - uot rp.Htri , ct, fid 
to apopulatiQ>xo£.pe£Sona..'R. A. Fisher speaks of a “popula- 
tion of possibilities,” referring to the possible results of 
an experiment many times repeated. Of hit|rh iip portajice 
in statistics ard^ populations of statistical measurements 
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means, coefficients of variation, standard deviations, etc. 
It is proper to note that the populations to which mggt- 
statistical results apply are infini te in size.. . Statistical 
generalizations relate to hypothetical universes containing 
in&ute numbers of units. We assume a sample to be 
drawn, not from the finite population that might be covered 
by actual enumeration, but from the infinite population, 
or universe, that would be generated if the forces or system 
of causes that brought this sample into being were to 
operate without limit. (Statisticians have given some atten- 
tion to special techniques, appropriate for dealing with a 
finite universe, but problems with which we do not here 
deal are faced in such applications.) 

The principle of the uniformity of nature is assumed, of 
comse, to apply to the universes from which our samples 
are drawn, if these samples are to be made bases of inductive 
generalizations. We must assume that these universes are 
stable, and that all their attributes are stable. An attribute 
of such a stable, universe may not be exactly deterinin^ 
from the attribute of a single sample, but measurements 
defining the attributes of numerous samples drawn from 
the same universe will be distributed about the true value 
(i.e., that of the universe) a systematic fashion. Each 
sample value is, of course, an '^tiinafe of the true value 
of the corresponding attribute of the population at large. ^ 
The precis e de termination of jthe characteristics of this 
d istrib ution of estimates. is essential tio tbe detennmaibiQn, 

Having knowledge of this distribution we may deter- 
mine the limi ts mt^ which estimates derived from dif- 
ferent s^ples, of the same population may be expected 
to flucti^^ A measure of thoge . hioite will serve. 

1 By convention, not yet generally adopted, but useful, the attribute of 
the population which is being estimated is termed a parameter ^ while the esti- 
mate of it is termed a statUHc. Our certain knowled^ is limited to statistics. 
We use this knowledge to the best of our ability to provide us with approx- 
imations to the true parameters which we can never Imow. 
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measure of the reliability of the given results, when gen- 
eralized. Such a measure might be secured by the labori- 
ous process of studying a great many different samples, 
just as the dice were thrown 4,096 times m a preceding 
example. Thus we might desire to test the reliability of an 
average of weekly ftflrniTiga of a certain class of workers. 
A first average might be secured from a sample composed 
of 250 individual records. This result might be tested by 
computing 499 additional averages, each based on 250 
individual records. These 500 averages would not be 
identical in value, but if they were tabulated a frequency 
distribution closely approximating the normal type would 
be secured. From this distribution we might compute the 
mean of all the averages and the standard deviation of 
these averages. This standard deviation would serve as a 
measure of the variation found in the averages of weekly 
earnings, as computed from successive samples. 

We have noted at an earlier point that a Gaussian or 
normal distribution is generated when three general condi- 
tions prevail. These are: 

a multiplicity of forces affecting each observation 

independence of the various forces affecting each observa- 
tion 

equality of the forces tending to generate values above 
and below the mean value. 

The process of random sampling which would, presumably, 
be employed in securing the successive samples referred to 
in the preceding paragraph should satisfy the conditions 
giving rise to the normal distribution. There should be no 
special or unbalanced influences affecting particular samples. 
The differences between successive samples should be such 
as arise from a combination of forces, intermingled and 
not open to separate definition; that is, from “the mass of 
floating causes known as chance.” If these conditions be 
met, and if the field of observation (i.e, the miiverse being 
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sampled) be homogeneous, the distribution of means com- 
puted, from the successive samples would be normal. 

TWs is a fact of high importance to statistical inference. 
In the realm of original observations, relating to persons, 
things, or events, normal distributions are the exception, 
rather than the rule. But the measurements which the 
statistician derives from successive samples, and which he 
employs in the inductive reasoning by which he generalizes 
his results, are far more frequently distributed in accordance 
with the Gaussian law. Much of the power of statistical 
instruments derives from this fact. 

The statistical investigator is rarely in a position to 
build up a frequency distribution of constants derived from 
numerous samples. It is generally impossible to take 400 
or 500 successive samples, in testing the reliability of a 
given measmrement. As a practicable alternative a process 
of mathematical deduction is employed, in determining the 
characteristics of distributions of statistical measurements 
derived from random samples,, drawn under stated conditions 
from given populations. An example of such mathematical 
deduction is provided by the derivation of the mean and 
standard deviation of a distribution generated under the 
following conditions: 

p, the probability of a given event occurring, is known 

q, the probability of the event not occurring, is knowm 

n, the number of independent events in a single trial, is 

known. 

Under these conditions, as was noted in the preceding 
chapter, M = np, and <r = y/npq.^ By a somewhat siinilar 
chain of reasoning, we may determine the characteristics 
of a distribution composed of arithmetic means of a nmnber 
of samples of constant size drawn from a given population. 
The standard deviation of mioh a distribution is given by 

VN 

^ For proofs, see Appendix B. 
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where (rjf is the required measure, a is the standard deviation 
of the population from which the samples have been drawn, 
and iV is the number of observations in each of the samples.^ 

The determination by deduction of the characteristics 
of distributions of statistical constants derived from samples 
is f imdarn ftjjtal to the whole process of statistic^ inference. 
It is not, of course, a task that needs to be done afresh in 
each statistical investigation. "When the law of distribution 
of a given class of statistical measurements has been deter- 
mined, statisticians may utilize the results in their various 
research fields, with due regard to all the conditions under 
which the given law holds This basic task has been per- 
formed for most of the statistical measurements currently 
employed. Earlier approximations have been refined in 
recent years for many classes of statistical measurements. 
The statistician today may draw upon a considerable body 
of tested and verified naaterials in determining the relia- 
bility of various kinds of statistical estimates. These 
materials exist in the form of shorthand expressions for 
the standard errors of different statistical constants, and 
in prepared tables for use when the distributions deviate 
materially from the type defined by the normal law of error. 

Phactical Problems of Sampling 

The preceding discussion has dealt with one aspect of 
statistical induction. The argument has proceeded on the 
assumption that inferences concerning the attributes of 
a population would be based upon a sample thorou^y 
representative of the universe from which it was drawn. 
The securing of such a sample is a first condition of valid 
statistical induction. Practical problems of the first impor- 
tance are faced in the actual field work of sampling. The 
procedures employed in such field work He, in the mam, 
beyond the scope of the present book, but it is desirable 


1 For proof see Appendix C* 
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that the general natiire of sampling techniques be indicated. 
References given at the end of the chapter deal in greater 
detail with these procedures. 

The task of securing an adequate sample calls, on the 
negative side, for an avoidance of bias in the individual 
observations and of preventable errors in schedules and 
tabulations. The term bias Js., applied to observational 
errors that are cumulative and non-compensating. Personal 
prejudices on the part of reporters, mental attitudes of 
which the subjects may be tmconscious, or the mere physical 
conditions of observation may lead to persistent errors 
that distort samples. Errors in recording and tabulation 
are easier to detect. Trainmg of enumerators and careful 
editing of schedules and tables will keep such errors to a 

■miniTniiTn. 

On the positive side sampling technique is directed toward 
the securing of a sample that is truly representative of the 
universe of inquiry. This is a major task, calling for a 
high degree of care and judgment m planning field opera- 
tions conforming to the ultimate objectives of the study. 
A. L. Bowley has classified, xmder the four heads distin- 
guished below, methods suitable for use in securing a 
representative sample. 

The method of random selection is employed when the 
entire population to be sampled is treated as a whole, and 
members of the sample are so chosen as to be random 
members of that population. In this selection the indi- 
vidual choices must be independent o f one another, and 
the chance of anv memb er of t he mtire population being 
included in the sample miist be iiie same as that of every 
other memoer. As regards the conditions of selection there 
should be present no element of preference or bias that 
would tend toward the inclusion or exclusion of certain 
members of the larger group. The general requirement 
here laid down should be interpreted, as J. M. Keynes 
has pointed out, to mean that loilh respect to the pv/rpose 
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of the particidar investigation the members of the sample 
shoxild be random members of the population at large. 
Intelligent planning is needed in securing a purely random 
sample. The obvious procedure of picking the most readily 
available cases would by no means meet the condition of 
random selection. Certain important elements in the uni- 
verse of facts to which the conclusions are to be applied 
may be excluded through the play of an imconscious bias 
unless careful attention is given to the selection of cases. 

The population from which a given sample is to be selected 
is often not homogeneous, with reference to the purpose of 
particular investigation. Slum districts and wealthy districts 
may both have to be covered, in a study of social or eco- 
nomic conditions. Agricultural districts differing mate- 
rially in fertility may be included in a farm survey. If, 
by a process of stratification, the universe of inquiry 
may be broken into sub-groups individually more homo- 
geneous than the total population, the reliability of sampling 
results may be substantially increased. Within each sub- 
group random selection may be employed. This method 
is termed straMii£d r q,jMkm The size of each group 

in the sample should be proportionate to the relative 
importance in the total population of the stratum repre- 
sented by that group. Where homogeneous sub-groups are 
secured by the process of stratification, and where the 
differences between the sub-groups are pronounced, this 
method is distinctly superior to that of random selection 
among the undifferentiated members of the population at 
large. 

In usiog the third method, that of vuroosive . selection, the 
statistician seeks to secure a sample having the same 
characteristics as the universe of inquiry in respect of one 
or more “control” factors. If these controls are hi^y 
correlated with the quantities that are the objects of investi- 
gation, this method of selection ^ves obvious support to 
generalizations based on the study of the sample. As in 
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stratified selection, sub-groups are employed. These sub- 
groups are chosen not at random, but in such a way as to 
possess, in the aggregate, the same attributes (e.g., means, 
standard deviations) as the population at large, in respect of 
the control factors. Deliberate manipulation, often through 
a process of trial and error, is necessary to effect this agree- 
ment between the sample and the totality. 

When this method is employed the statistician must, 
of course, have information concerning the “controls” for 
the total popTilation. The application of the method is 
restricted to fields in which such knowledge is available. 
Census type inquiries on population, agriculture, and manu- 
factures provide such basic knowledge. Promising work 
has been done in purposive selection in dealing with agricul- 
tural data. 

The fourth method, that of siratified purposive selection, 
represents a combination of the use of stratification to 
secTire homogeneous sub-groups and of deliberate selection 
through the use of controls. Where data are open to such 
stratification, and where necessary controls are available, 
the combined procedures may profitably be employed. 

When a representative sample has been secured, when 
errors and bias have been avoided, we may still espect the 
attributes of the sample to differ from those of the total 
population. The effects of fluctuations of sampling will 
still be present, so long as the coverage of the sample falls 
short of the universe of inquiry. We may only estimate the 
attributes of the population; we still iace. tha-uncertalnties . 
that inhere in induction. It is possible, however, to define 
witlr“ctmsiifeMbl6'']pii^^ , the. probabilities involved, in 
stati s ffc^" ind imtion. .when the differences between the 
attributes of the sample and those of the total population 
are due to fluctuations of “simple samplmg,” that is, to 
the scrambled mass of causes that constitutes chance. 
Under these conditions it is possible to assign in advance 
limits within which we may espect statistical measures 
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derived from different samples of the same population to 
fluctuate. This means that we may apply to the population 
at large statistical measures secured from the study of a 
sample, not with confidence in their perfect stability, but 
with fairly definite knowledge of the margin of error involved 
in thus extending our results. Where the necessary condi- 
tions are fulfilled statistical induction is a valid procedure. 

Use of Measuhes of Rbliabiijty 

Measurements defining the sampling errors to which given 
statistical constants are subject are put to various uses. 
It is in order now briefly to review the standard errors of 
different statistical measurements, and to illustrate their 
applications. 

SAMPLING ERSORS: THE MEAN 

For the standard error of an arithmetic mean we have 

or 

Cm = — 7 = 

VN 

where the symbol <r in the numerator of the right-hand term 
refers to the standard deviation of the population from 
which the sample is drawn and N is the number of observa- 
tions in the sample. Actually, of course, we do not know 
the standard deviation of the population, but we use as 
an approximation to it the standard deviation of the 
sample. The approxunation is acceptable except when the 
number of observations in the sample is small, in which 
case special treatment is needed.^ 

Reference has been made above to the fact that a dis- 
tribution of arithmetic means computed from random 
samples of a given population usually follows the normal 
law of error. T^Jg. true even , though the distribution 
of the population from w|i^ ,t](ie,. samples, ^e drawn is 
not itself normal,. Accor<fin^y, we may interpret ^ven 

» See Chapter xvni. 
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values of <tu with reference to the probabilities associated 
with deviations in a normal distribution. 

Table 34 in Chapter V shows the distribution in 1933 
of 11,404 workers in open hearth steel furnaces, classified 
according to their average hourly earnings. The arithmetic 
mean of this distribution is 50 14 cents; the standard 
deviation, which we may here represent by s, is 18 . 685 cents. 
Accepting this standard deviation as an approximation to 
the standard deviation of the population from which this 
sample was drawn, ^ we have 


18 685 



.176. 


The true mean of the hourly earnings of wage workers 
in open hearth furnaces in 1933 is not known. The figure 
50.14 cents is our best approximation to it. If we should 
draw many samples, each the size of the one we have here, 
we should have many mean values normally distributed 
and centering, we may assume, at the true value. The 
standard deviation of this normal distribution we estimate 


^ The formula for the standard error of the mean, when the <r of the popula- 


tion is known, is given by <ru = 


VF’ 


When the standard deviation of the 


population is replaced by that of the sample (s), as an approximation to the 
desired quantity, the formula for (tm may be written 


The fir st of t hese is appropriate if « has been derived from the rdiation 

a « ^ (where d is the deviation of a ^in^e observation from the 

mean); ^ second is appropriate if % has been derived from the relation 

a => 4 /^* In other words, iV should be reduced by 1 either in the derivation 

of a or in the derivation of <ru- If <rif is derived from the d’s of the original data, 
the single operation is summed up in Bessel's formula 

_ J 

""" y N(N - 1) 


(See Whittaker and Robinson, Calculus of Ohservationsi London, Blackie & 
Son, 1924, 205-206.) The reason for the reduction oi N is discussed in Chap- 
ters XV and XVIII, in dealing with “d^rees of freedom." 
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as .176 cents. Knowledge of this standard deviation, or 
standard error, enables us to set limits within which it is 
highly likely that the true mean lies. Any statements we 
TYin-lrp. about the true Tnftfln are to be interpreted with 
reference to this figure. 

We might, for example, on the basis of these results, 
rftnkft the fiat statement: The true mean of the population 
lies between 49.965 cents and 50.315 cents. (The first of 
these limits is the sample mean plus one standard error; 
the second is the sample mean minus one standard error.) 
We may not assert that this statement is certainly true. 
It may be true or false. But if we continue indefinitely to 
draw samples from the population in question, computing 
the mean of each and the standard error of that mean, and 
if we make a statement about each similar to that made 
above, 68 out of 100 such statements wfil be true. (The 
actual numerical limits set by the different statements will 
differ, of course.) 

It is possible to vary the statement according to the 
degree of probability we wish to work with. Thus we might 
say: The true mean of the population lies between 49.80 
cents and 50.48 cents. ^ Of an indefinitely large number 
of such statements, each based on the study of a sample 
similar to the one before us, we know that 96 out of 100 
would be true. This is the kind of knowledge we have 
about generalizations based on results obtained from samples. 

The essential facts concerning the mean of the present 
sample and its reliability may be summarized in the state- 
ment: The mean hourly earnings of wage workers in open 
hearth furnaces in 1933 was (in cents) 50 14 ± . 175.® 

M9.80 = 50 14 - (1.96 X .175) 

50.48 - 50.14 + (1.96 X .175) 

Niaety-fiye per cent of th.e area under a normal curve ia included vritbin 
ilf± 1.96a. 

* Xhe measure of sampling rdUability here given is the standard error. The 
iaaditional usage, now less commonly follow^ has been to give the probable 
error, which is .6745 times the standard error. In the present example the 
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The standard error of a mean is frequently used, not 
merely as an abstract measure of sampling reliability, but 
as an instrument for testing a given hypothesis. Such 
an hypothesis usually involves an assumed parent popula- 
tion, and the test centers about the question whether a 
pven sample could have been drawn from this parent 
population. Let us assume that, on rational grotmds, we 
have set up the hypothesis that the mean duration of 
business cycles is five years. We have observations relating 
to 77 cycles occurring in various countries during stages 
of rapid industrialization.^ These cycles are distributed, in 
respect of duration, as follows: 


Dvraivm 0 / cydes, 
in years 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


Number of 
cycles 

3 

10 

22 

15 

12 

8 

2 

2 

2 

_1 

77 


The mean duration of these 77 cycles is 4.09 years, and 
the standard deviation of the distribution is 1.88 years. 
For the standard error of the mean we have 


1.88 


. 216 . 


Are these results consistent with the hypothesis that our 
sample of 77 cycles is drawn from a parent population 

probable error of the mean is . 118 cents. It is well, in any case, to specify 
the exact measure of reliability being used. 

^ Cf. W. C. Mitchdl, BvMnesB Cycles^ The Problem and Its Setting, New 
York, National Bureau of Business Research, 1927, 412-416; F. C. Mills, 
*'An Hypothesis Concerning the Duration of Business Cycles,” Jtmmcd of 
the Arnencan Statistical Assodatim, Dec^nber, 1926, Vol. 21, 447-467. 
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(i.e., a universe of cycles generated under similar conditions) 
with a mean duration of five years? 

If we use M to represent the mean of the sample data, 
Mh to represent the hypothetical mean of the umverse, 
and T to denote the deviation of our sample mean from 
the hypothetical mean, expressed in units of the standard 
error of the mean, we may write 

M-Mk 4.09 - 5.00 .91 _ ^ 

^ .216 “ .216 • 

The figure . 216 is, according to our hypothesis, the standard 
deviation of a distribution of arithmetic averages the mean 
value of which is 5.00. If we were drawing from such a 
distribution, the mean of our present sample would represent 
a departure of 4.21 standard deviations from the general 
mean. What is the probability of such a departure occurring 
merely as a result of chance? Consulting a table showing 
areas imder the normal curve, we find that the area on 
one side of the mean, lying at a distance of 4.21 standard 
deviations or more from the mean, constitutes 1/100,000 
of the total area under the curve. In terms of probabilities, 
this means that there is only one chance in 100,000 that 
a member of the population represented by the normal 
curve will fall below the mean value by 4.21 standard 
deviations or more. This chance is so remote that we say 
the event in question could not occur. With reference to 
the present problem, we conclude that the results are not 
consistent with the hypothesis. We could not have secured 
the sample values in question had we been drawing from 
a universe of cycles with a mean duration of five years. 
The results fail to confirm the theory we have set up. 

The probability cited (1/100,000) relates to a deviation 
on one side of the hypothetical value only. If we wish to 
define the probability of an observation departing from the 
hypothetical mean value 5.00 by 4.21 standard deviations 
or more, without reference to whether the departure be 
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above or below the h37pothetical value, we must double 
the above probability. The chance of such a departure in 
one or the other direction is 2/100,000. Tests of hypotheses 
usually take this latter form. It is customary to ask whether 
a deviation of a stated magnitude could occur, and to 
meMure the probabilities involved with reference to deviar- 
tions in both directions. 

Tn "using tables of the normal probability integral in 
tests of this type we are generally concerned with the 
probability of occurrence of deviations as great as or greater 
than some stated value (in the above example, .91 years, 
or 4.21 standard deviations). This probability is repre- 
sented by areas m the two tails of a normal curve (assuming 
that deviations either above or below the mean are in 
question). The inside limits of these segments are set 
by ordinates erected at distances from the mean equal 
to the deviation in question; the outside limits are at 
infinity. (See Fig. 85, in Chapter XIII, for a graphic 
representation of segments lying beyond stated Ihnits.) The 
usual tables of the probability integral define the areas 
falling wUhin limits set by ordinates at specific points. 
Our concern is with areas beyond these ordinates. Sub- 
traction of the internal area from the whole area (unity) 
will, of course, give the area of the external portion defining 
the probability that is here desired. 

If we should be testing the hypothesis that the mean 
duration of business cycles is four years, we derive the 
value of T as follows: 




4.09-4.00 

.216 


.42. 


From the tabulated values of areas under the normal curve 

‘See W. Ed\raids Deming and Raymond T. Birge, “Oa the Statisdoal 
Theory of Errors,” Revmos 0 / Modem Phyeicsy Vol. 6, July, 1934, 133ff., for a 
discussioKL of this probability, which they designate Pu, and tests based on it. 
Qoa. their tenninology, u is the difference between the mean of the sample 
and the mean of the assumed population.) This article includes a chart (134) 
for use in determioiog the significance of a given deviation. 
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we determine that approximately 67 per cent of aU the 
observations in a normal distribution will deviate from 
the mean value by .42 standard deviations or more. We 
interpret this to mean that if our sample of 77 observations 
were drawn from a universe with a mean value of 4.00 
years, the chances are 67 out of 100 that the mean of the 
sample would depart from the population mean by . 09 years 
or more. (We have counted the combined probabilities 
of deviations above and below the population mean.) In 
other words, a deviation as great as the one we have experi- 
enced is highly probable. The results are not inconsistent 
with the hypothesis that the mean duration of business 
cycles is 4.00 years. They do not, be it noted, prove the 
hypothesis. All that we may say of statistical evidence, 
on the positive side, is that it is not inconsistent with a 
given hypothesis. Supporting statistical evidence stren pt.hens. 
our confidence in the hypothesi s, of course. Its ten abilitv 
must be detftrmiPRrf ^7^ Knaia nt rg.tir>T^n1 ftnngiAarat.inT^Q 
as we U, as empirical evidence. 

This last point deserves emphasis. “The agnificance of 
each test,” say Deming and Birge,^ “depends not only on 
the value of P (i.e., the measure of probability appropriate 
to the test) that is found, but also on how much is known 
a priori regarding the parent population.” The above 
hypothesis of a four-year cycle has no particidar rational 
basis (the figure was used here, of course, to exemplify 
a procedure). The fact that the observed results are not 
inconsistent with it is significant in a negative way, but 
does not establish the truth of the hypothesis. Low values 
of P, indicating that the facts are inconsistent with given 
hjTpotheses, are highly usefxil in leading us to reject tentative 
formulations of theory. Acceptable values of P, however, 
need the support of other knowledge (o priori and empirical) 
concerning the body of materials being studied and the 
regularities prevailing therein. Within the limit of acceptable 

* Loc. ait., 137. 
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values, indeed, we may accept one hypothesis, rather than 
another for which empirical tests yield a higher value of 
P, because the former is more consistent with the general 
body of existing knowledge concerning the field in ques- 
tion. 

In the two tests we have applied, no difficulty was encoun- 
tered in mterpreting the probabilities bearing on the relation 
between the h 3 q)othetical mean and the observed facts. 
In the one case the odds were so small as to leave no doubt 
as to the lack of agreement; in the other case the difference 
was clearly insignificant. But many tests will lie on the 
b orderlin e, and we must have some reasonable criterion 
as to the limit of significance. Odds of 1 out of 100 constitute 
one conventional standard. If a given difference between 
hypothetical and observed values would occur as a result 
of chance only 1 time out of 100, or less frequently, we may 
say that the difference is significant. This means that the 
results are not consistent with the hjrpothesis we have 
set up. If the discrepancy between theory and observation 
might occm: more frequently than 1 time out of 100 solely 
because of the play of chance, we may say that the difference 
is not clearly significant. The results are not inconsistent 
with the hypothesis. ,The value of T (the differen ce between 
^e hy pothetical value and the nbaftryftd Tnftfl.n^ in miTt,s 
of th e standard error of the mean) corresponding to a^ 
proba bility of 1/100 is 2.576. On e hundredth part of the 
area’ iinder the normal curve lies at a distance from the 
mean, on the a;-axis, of 2.676 standard deviations or more. 
Accordingly, tests of significance may be applied with direct 
reference to T, mterpreted as a normal deviate (i.e., as a 
deviation from the mean of a normal distribution expressed 
in units of the standard deviation). A value for T of 2.576 
or more indicates a significant difference, while a value of 
less than 2 . 576 indicates that the results are not inconsistent 
with the hypothesis in question. 

There is, of course, nothing rigid about this particular 
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standard. Some statistical workers employ odds of 1 out 
of 20 as a limit, rather than 1 out of 100. With this standard 
we would accept as significant (i.e., not due to chance) a 
difference between hypothetical and observed values that 
would occur only 5 times out of 100, or less frequently, as 
a result of random fluctuations of sampMng. The value 
of T corresponding to this standard is 1.96. The standards 
of significance actually employed by a research worker may 
well vary from problem to problem. The investigator uses 
the results of these tests of significance as aids in the inter- 
pretation of his results and in the development of a body of 
theory that is not inconsistent with the evidence provided 
by experience. In the interplay of deduction and induction 
that marks such a process, no single absolute standard 
for the rejection or acceptance of hypotheses would be 
appropriate. 

The formula for the standard error of a mean, as given 
above, relates to a sample chosen by random selection. 
For a proportionately stratified sample the standard error 
of the mean, Vm,, may be derived from the relation 

^ 

where <ro is the standard error of the same mean as it woxild 
have been had the N observations been taken at random 
from the universe of inquiry, and (r*, is the standard deviation 
of the averages of the several strata about the average 
of the whole sample.^ In computing <r„ the deviation of 
the mean of each stratum is wei^ted in proportion to the 
number of cases in that stratum. N is the total number 
of observations in the sample. It is clear from the formula 
that the standard error of the mean of the stratified sample 
is smaller than the standard error of a corresponding random 
sample. 

‘ The above fomula is from A. L. Bowley, Elementa of StaHOict, London. 
King, sixth ed., 1937. 
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BAMFUNG EKBOBS: MEDIAN AND QEABTIIiES 

The median is subject to greater sampling fluctuations 
than is the mean. The degree of dispersion of median 
values derived from a number of samples of a stated size 
from a given population will be approximately 25 per cent 
greater than the dispersion of the arithmetic means of 
the same samples. More exactly, we have 

= 1.25331 

Estimates of the quartiles, in turn, are less accurate than 
are estimates of the median. For these we have 

ctqi = 0q3 = 1. 36263 1^===|’ 


SAMPLING BBBOBS: STANDABD DEVIATION 


In determining the magnitude of the sampling errors to 
which the standard deviation is subject we must distinguish 
between samples drawn from a normally distributed universe 
and those derived in the more general case, in which the na- 
ture of the distribution of the universe is unknown. If the 
distribution of the universe is normal we have, as the esti- 
mated standard error of cr. 


(whore JV— 1 has been used in the computation of s). Thus, 
for the tmiverse of residential telephone subscribers repre- 
sented by the distribution in Table 109, we have 


147.7 

v'1,990 


3.31. 


The more general formula for the standard, error of the 
standard deviation involves the fourth as well as the second 
moment of the distribution: 




«4 — 

4u» -N 



474 INDUCTION AND SAMPLING 


For the distribution based on hourly earnings in open-hearth 
steel furnaces in 1933 the standard deviation was 18 685 
cents (see Table 34) . As the standard error of this measure- 
ment we have* 


cr» 



1,384.1183 - (13.9674)* 
4 X 13.9674 X 11,404 


.0432. 


Since the moments here employed are in class-interval 
imits, the derived measurement is also in those terms. In 
the original units we have 


<r, = .0432 X 5 cents = .2160 cents. 


Many tests of significance involve the use of standard 
deviations and corresponding measurements of sampling 
reliability. These are discussed more fully in the chapter 
on the analysis of variance. 


SAMPLING EBEORS: COEFFICIENT OF CORRELATION 

A number of distinctive problems are faced in generalizing 
the results of correlation studies and in determining the 
significance of the measurements secured in such studies. 
Certain of these problems are discussed in the succeeding 
chapter, and Chapter XVIII deals with important limita- 
tions that are faced when the samples employed are small. 
At this point general methods of measuring the reliability 
of correlation measurements are presented, without certain 
of the qualifications that will be discussed later. 

As a basi c formula for the aaTnnling error of the nnfifflniftnt, 
of correlation computed from N pairs of observations, we 
have — — — 

1 - r* 

where r, in the numerator of the right-hand member, is 
the true coefl&cient of correlation in the population at large. 
Since we do not know the true r we must use the r of the 

‘ Since Shqjpard’s corrections are not appropriate to this distribution, the 
uncorrected moments are used. 
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sample as an estimate of the required value. This formula 
may be taken to hold for distributions approaching the 
normal type, when the number of cases included in the 
sample is fairly large — say 50 or more. When the sample 
is small and, particularly, when we are dealing with a 
relatively high coefficient of correlation derived from a 
small sample, the standard error secured from the formula 
cited above may be faulty, and tests of significance based 
on it misleading. The reason for this and means of meeting 
the difficulty are discussed in Chapter XVIII. 

In exemplifying the application of the usual test, we 
may employ results presented in Chapter X, on the relation 
between the discount rates of Federal Reserve banks and 
of commercial banks. The value of r is + .84, while N 
equals 1,800. Accordingly, we have 


Vr 


1 - (.84)^ _ .2944 
Vl,800 - 1 42.40 • 


The standard error of r is frequently used, as are similar 
measurements relating to other statistical constants, to test 
hypotheses. We may put such a question as the following: 
Is the value of r secured from a given sample significant 
of a real relationship between the variables in question 
in the population from which the sample was drawn? 
Putting the question in form more appropriate for testing; 
Is the present value of r consistent with the h 3 q)othesis 
that there is no relationship between the variables in 
question in the population at large? E. A. Fisher terms 
such an hypothesis a “ miU hynnthesis .” The purpose of 
experiment, in his words, is to give the facts a chance of 
disproving the null hypothesis. 

In a study of the movements of commodity prices, 1,202 
^fiaj=iiirATnATif.g TOPre, op thft tiTnin g of advances in 

the prices of, individual commodities during periods of 
general business revival. Paired with each meflaiirftmftnt 
was a similar observation on the timin g of the declme in 
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t.TiP nf t.lift given f-nm-mnditv during the succeeding 

p ftrin«^ of general business recession.^ We desire to know 
whether there is any relation between the sequence of price 
revival and the sequence of price recession. Is there a 
pattern in price movements during business cycles? Evidence 
of the existence of such a persistent pattern would lend 
support to the view that cycles represent true regularities 
in economic life. 

These 1,202 pairs of observations yield a correlation 
coefficient of + .27. This does now show a pronounced 
degree of relationship. Our chief concern, however, is not 
with the magnitude of r. We wish to know whether the 
result is consistent with the hypothesis that the true corre- 
lation is zero. For the standard error or r we have 


1 

VI, 202 - 1 


.029. 


By hypothesis, the population value of r is zero, so the 
numerator of the fraction is 1. 

If the true value of r were zero, and the standard error 
of r were . 029, what would the probability be that, as a re- 
sult of chance, we should secure a coefficient of + .27 from 
a given sample? Since this value represents a departure 
of more than 9 Vs from the h 3 q)othetical value of zero, 
the probability that the difference is due to chance is 
infinitely small. We conclude that the results are not 
consistent with the hypothesis that the sequence of price 
change during revival is unrelated to the sequence of decline 
in a succeeding recession. The null hypothesis is disproved. 

Hhd the value of T ^in this case T = ^ ~ been less 

than 2.576 the conclusion would of course have been differ- 
ent. In such a case the discrepancy between the sample r 
and the hypothetical value of zero could be attributed to 


^ The Behcanor of PriceSf New York, National Bureau of ESconomic Researcht 
1927 , 131 . 
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sampling fluctuations. The result would not be inconsistent 
with the null hypothesis. 

Having established that the results are not consistent 
with the hypothesis that the true value of r is zero, we may 
compute the standard error of r as actually derived. Assum- 
ing now that the sample is dravra from a parent population 
in which r = -f . 27, we have 


1 - (.27)» 
Vl,202 - 1 


.027. 


SAMSUNG EREOKS: INDEX OP COREELATION 

The standard error of the index of correlation may be 
approximated from the relation 


In this formula m represents the number of constants in 
the equation of regression. In the example cited in Chapter 
XII, relating to alfalfa yield and depth of irrigation water, 
p is .80, N is 44, and m has a value of 3. We have, thus 




1 - (.80)» 


= .056. 


The use and interpretation of this measure are analogous 
to those of ffr. In the present instance the index of correlation 
is clearly significant.^ 


SAMPUNG EBROBS: THE TEST FOB UNEABTIT 

As a test for linearity we have been given 
^ = 7)^ — r*. 

But we wish to know whether, in a given case, the difference 

‘ See Eadkiel, M., Methods of OorrdaiMn Analyste, N. Y., John and 
Sons, 1930, 257-258, for a discussion of the samplidg reliability of the indes of 
conation. 
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between and may be due merely to a chance fluctuation 
of sampling, or to a real departure of the underlying rela- 
tionship from the linear form. As the standard error of f 
Blakeman has proposed 

(Tf = 2y|/I V(1 - U*)* - (1 - r-r + 1. 

The use of this measure may be illustrated with reference 
to the problem relating to wheat jdeld which was considered 
in an earlier chapter. For the relation between wheat 
yield and amoimt of nitrogen used as fertilizer, we had 

r = + .793 
= .966 
iV = 193. 

(The uncorrected value of should be used here.) 

Therefore 

f = ^2 _ r* = .302. 

Inserting the given values in the formula for ci-and solving, 
we have 

(rj.= .074. 

With t having a value of . 302, about 4 . 08 times its standard 
error, there can be no question as to the non-linearity of 
the relationship. The difference between and r® is one 
which could hardly be due to chance fluctuations of sam- 
pling. 

The criterion — r* is not very satisfactory as a test 
of linearity, since the distribution of t does not follow the 
Tinrmfl.1 The same weakne^atiaches to the correla- 
tion ratio. As Fisher has demonstrated, the distribution of 
V does not tend to normality, even with large samples, 
unless the number of arrays is increased without limit. 
Accordingly, the standard error of is of dubious utility. 
More efl^cient methods of testing for the existence of 
correlation, and for linearity, are discussed in Chapter XV. 



SAMPLING ERRORS 


479 


SAMPLING BRROES; COEFFICIENT OF RANK CORRELATION 

The standard error of the coefficient of rank correlation 
has been given by “Student” as 

It is notable that this value is independent of the true 
value of pr.^ This standard error may be taken to relate 
to a normal distribution, and interpreted in the familiar 
manner, when N is fairly large, say 45-50 or more. For 
amflll samples the distribution of p is not normal. In the 
example cited in Chapter X, dealing with the relation 
between the number of individual income tax returns and 
the number of passenger automobiles registered in 1934, by 
states, we had p, = .94. Since there are 47 observations, 
the value of v,, is given by 



The sample is large enough to justify the assumption that 
the distribution of pr would approximate the normal type. 
The coefficient of rank correlation is clearly significant, 
being more than six times its standard error. 

SAMPLING errors: COBFFiaENT OF REGRESSION 

High importance frequently attaches to the coefficient 
of regression, in dealing with relationships among variable 
quantities. For the standard error of this measurement we 
have* 


/T — ^ 

where x is a given value of the independent variable,' 

^ See Hotelling and Pabst, he. cit. 

* See R. A. I^er, StaHsticdl Methods for Research WorherSf Edinburgh 
OKver and Boyd, sixth edition, 1936, 134r-146. 
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expressed as a deviation from the mean of that variable, 
and s„ is the root mean square of the deviations of the 
actual values of y, the dependent variable, from the corre- 
sponding computed values. That is, Sy is a measme of the 
scatter about the line of regression.^ 

A test involving the use of ers may be applied to data 
relating to the average com yield per acre in Kansas, by 
years, from 1890 to 1933 (see Table 128, Chapter XVI). 
These yields show a fairly consistent dec lini n g trend. A 
line of trend fitted to the figures for these 44 years is defined 
by the equation 

Y = 22.05 - .1074Z 

where Y denotes com yield per acre and X denotes time, 
in years, with origin at 1889. We wish to know whether 
the coefficient of regression (i.e., the slope of the line of 
trend) represents a significant departure from zero. The 
hypothesis we are testing is, then, that the tme value of 
the coefficient of regression, in the population from which 
this sample is drawn, is zero — that there has been no 
significant dechne in com yield in Kansas over the period 
in question.® 

For Sy we secure the value 6.70, for VSr® the value 
84.2. Accordingly 

We may denote by the symbol j8 the coefficient of regression 

where y denotes a given value of the dependent variable and ye denotes the 
corresponding value derived from the equation of regression. In the computa- 
tion of Sy for this purpose N must be reduced by the number of constants in 
the equation of regression. 

* The hypothetical population of which we assume our sample to be repre- 
sentative is the population that would be generated by the forces responsible 
for variation in Kansas com yields from 1890 to 1933, if those forces, un- 
changed, were to act upon an infinite numte of cases. The application of 
tins concept, and of the whole probability (^culus, to data ord^:ed in time 
involves some logical difilculties, which are discussed at a later point. 
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assumed in our hypothesis (in this case zero). We wish to 
know whether the deviation of our actual 6 from this 
hypothetical jS may be attributed to chance, or whether 
it is too great to be so explained. This deviation should 
be expressed in units of the standard error of 6, in order 
that the probabilities underlying the normal distribution 
may be applied in our reasoning. Using T, as before, to 
denote the deviation in units of a, we have 

T = = - .1074 -0 

(Ti .0795 

= - 1 . 35 . 

The given value of b represents a departure of 1.35 
standard deviations from the mean value of zero in our 
hypothetical population. As may readily be determined by 
reference to the table of the probability integral, such a 
deviation might easily occur, as a result of chance alone. 
The results then, are not inconsistent with the hypothesis. 
There is no clear evidence here of a significant decline in 
corn yield per acre in Kansas during the period covered. 

SAMPUNG ERBOKS: DIPPBBENCE BETWEEN MEANS 

A problem of sampling that arises rather frequently is 
that of determining whether two samples could have been 
drawn from the same parent population. Obviously, there 
would be some difference between the means of two samples 
from the same universe, as there would be between standard 
deviations or coefficients of correlation seemed from different 
sampling operations. We may illustrate the procedure 
employed in determining the significance of a difference 
between two arithmetic means. 

Reference has been made above to a sample of 77 business 
cycles, occurring during stages of rapid industrialization. 
Their mean duration was 4 . 09 years ; the standard deviation 
of the distribution was 1.88 years. The same investigation 
indicated that the mean duration of 51 busing cycles 
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occurring in various economies during early stages of 
industrialization was 5.86 years, and that the standard 
deviation of these measurements was 2.41 years. There is 
an indication here that business cycles are accelerated, 
that their average length is shortened, when an economy 
is passing through a phase of rapid industrialization with 
corresponding impetus to technological change. In this 
ease the null hj^othesis against which we set our facts 
is that there is no difference, in respect of duration, between 
business cycles occurring in the two stages of industrialization 
named. 

The difference between two means is a statistical meas- 
urement subject to a definite law of distribution. If a 
great many pairs of samples were drawn from a given 
population, the value D (i.e.. Mi — M 2 ) could be computed 
from the two means of each pair. A frequency distribution 
of the D’s thus secured would follow the normal law. The 
magnitude of the standard deviation of this distribution 
would be a function of the sizes of the samples thus paired 
and of the standard deviations of these samples. We may 
approximate the standard deviation of this distribution of 
D’s from the relation 


or from 



+ 


Nt-1 




M, 


The measurement needed for testing the hypothesis now 
before us is computed from the relation 


y 76 ^~50~ 

= vn^ 

= .4034. 


The value of D, the difference between the two means, is 
5.86 — 4.09, or 1.77. This value of D is to be judged 
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with reference to a hypothetical value of zero. Accordingly, 
for T (the discrepancy expressed as a normal deviate) we 
have 


1.77-0 

.4034 


4.39. 


This discrepancy far exceeds the magnitude 2.576, corre- 
sponding to odds of 1 out of 100. If the true value of D 
were zero, a discrepancy as great as this or greater would 
occur as a result of chance about 1 time out of 100,000 
trials. The results indicate that the difference between 
the two means is not due to chance. The facts are not 
consistent with the hypothesis that the two samples are 
drawn from the same population. There is a significant 
difference between the average durations of business cycles 
occurring in early stages of industrialization and in later 
stages of rapid industrial change. 


SAMPIilNG EKBOES: DIPFBBENCB BBTWBBN PBBCENTAQBB 

There are occasions when it is desirable to detennine 
whether a difference between two proportions (or percent- 
ages) is significant. Usmg Z)„ to denote such a difference, 
we have 



where p# is the weighted mean proportion, g, is 1 — po, and 
N I and JVj are the total numbers of cases in the two samples 
to which the proportions relate.^ (In computing this value 
and applying the corresponding test it is necessary to divide 
percentages by 100, to reduce them to the form of propor- 
tions or ratios.) 

A tabulation of American and foreign business cycles 
by Wesley C. Mitchell has indicated a relative preponder- 
ance of three-year cycles in American experience. Of 32 

^See Homdl Haxt, Reliability of a Percentage,” Jmmdl of the 

Afnericm StcUisUcal AssooicMont Vol. 21, March, 1926. 
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American cycles 10, or 31.2 per cent, lasted 3 years; of 
134 cycles in other countries 20, or 14.9 per cent, lasted 
3 years, ^ Is the difference between these two percentages 
great enough to justify the inference that the forces acting 
upon American business differ from those acting abroad, 
creating a significantly higher percentage of three-year 
cycles? The hypothesis that we test in this case is that the 
difference is not significant, that the groups of American 
and foreign business cycles are drawn from the same 
universe. 

The two proportions, pi and pa, with which we work are 
.312 and .149. The (fffference Dp between the two pro- 
portions is .312 — 149 or .163. For the weighted mean 
proportion we have 

_ Nipi -F JVaPa 
Ni + Ni 

_ (32 X .312) + (134 X .149) 

32 + 134 

go = 1 - Po = • 8196. 

We compute the standard error of D, from the relation- 
ship shown above 

V- ISM X -8196(^ + 313) 

= .005724 
O/jj — '.0767, 

Between the given value of Dp and the hypothetical value 
of zero we have the discrepancy (expressed as a normal 
deviate) 

rp _ .163 - 0 
.0757 

= 2.15. 

A discrepancy as great as this or greater might occur, 
as a result of chance, about 3 times out of 1(X). If our stand- 

* See Bimnest Cycles, the Problem and Its SeUing, N. Y., National Bureau of 
Economic Research, 1927, 399-400. 
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ard of significance is 1 out of 100 we must conclude that 
the difference between the two percentages is not clearly 
significant. The result is not inconsistent with the h 3 ^othesis 
we set out to test — that American and foreign business 
cycles are drawn from the same universe, in respect of 
the proportion of three-year cycles occurriag. It is proper 
to say, however, that we are dealing with border line results. 
If our standard of significance were 1 out of 20 we should 
consider the difference between American and foreign 
experience significant. Perhaps we should say that although 
the present evidence does not provide conclusive proof that 
the two samples come from different universes, there is 
indication of a difference between the forces affecting the 
relative frequency of three-year cycles in the United States 
and in foreign countries. Such results call for further 
research, in order that a more definite conclusion may be 
reached. 

SAMPLING ERROES AND SIGNIFICANT FIGURES 

In deciding upon the munber of figures to be recorded 
as significant, measures of sampling errors are, of course, 
pertinent. A useful general rule laid down by Truman L. 
Kelley follows: In a final 'published constant, retain no Jigures 
beyond the position of the first significant figure in one third 
of the standard error; keep two more places in all computor 
tions.^ Its application may be illustrated with reference to 
the figures on hourly earnings of 11,404 steel workers in 
1933. The mean, to four places, is 60.1360 cents. The 
standard error of the mean is . 175 cents. One third of this 
is .0583. The first significant figure is in the column of 
hundredths. By the rule, therefore, the arithmetic mean 
should be given as 50.14 cents. Two more places, or four 
decimal places in all, should be retained in calculations. 

' The rule here given is the suggestion as re-phiased by P. J. Bulon 

(JSdenee, N. S. VoL 84, No. 2,187, Nov. 27, 1936, 484). I have changed "one 
half tbe probable error” in Rulon’s statement to “one third of the standard 
error.” 
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Some Limitations to Measttbes op Sampling Errors 

The importance of such measures of reliability as have 
been discussed above is, of course, great. With their aid 
we may give precision to our judgments concerning the 
margi-na of error involved in extending statistical results 
beyond the limits of actual observation. Yet limitations 
attach to them, and these must not be forgotten in a purely 
mechanical application of statistical tests. 

Reference has been made to limitations relating to the 
fflze of samples. In the interpretation of most measures of 
sampling errors the assumption is made that statistical 
measurements secured from successive samples are dis- 
tributed in accordance with the normal law of error. When 
the number of cases is large this is approximately true, even 
though the original data are not so distributed. But with 
a small number of cases in each sample this assumption 
may be quite invalid. The significance of given deviations 
(in terms of T) is therefore materially altered when we are 
dealing with results secured from small samples. Techniques 
have been developed, however, for defining sampling errors 
based on small samples. These are discussed at a later 
point (Chapter XVIII). 

Moreover, the conventional standard errors we have 
discussed can be assumed to measme only errors arising 
from the fluctuations of simple sampling. If there is to be 
full conformity to the conditions of simple sampling, the 
probability of a given event occurring must be the same 
in all parts of the universe being sampled and for all time 
periods included, and the individual events (i.e., drawings 
or observations) must be completely independent of one 
another. The fact that customary error formulas are 
strictly applicable only when these conditions have been met 
injects elements of doubt into many statistical inductions 
in the field of economics. We cannot always be sure that 
the conditions of simple sampling are actually fulfilled. 
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They are rarely perfectly fulfilled in the handling of economic 
data. The standard errors derived above can. ©ye no 
indic ation of the possibility_.Qf .fluctuations., in successive 
s amples due to causes other than thMe_|^ing from smple 
sapling. Fluctuations due to bias, due to lack of repre- 
sentativeness in the sample, due to persistent errors of 
any sort, quite elude this method of determining probable 
stability. Although some degree of departure from the 
rigid conditions of perfect sampling does not deprive the 
measures of reliability of all value, the limitations noted 
must be the constant concern of the statistician. 

The element of time adds one serious difficulty to the 
problem of statistical induction in the realm of economics, 
and in the social sciences generally. .A uni'\^^ that^ex!i,end9^ 
over tiine^^is.mibjeg;^tojdemen^^ that ^e not 

presenir among data renting to a cross-action of .June. 
Conditions of "pig’ iron production, of banking, of foreign 
trade, of income distribution change from year to year, 
even from month to month. We may hardly assume that 
data relating to different time periods reflect the play of 
identical forces. When we deal with data from different 
periods we are, as Oskar Anderson has pointed out, drawing 
from different universes. The structural changes that occur 
in economic organization axe manifestations of this state 
of never-endmg transition. Accordingly the homogeneity 
of all populations extending over time is suspect. In par- 
ticular are hazards faced when an induction extends to a 
time period not covered by the data of observation. 

The fitting of trend lines, and the use of deviations from 
trend in statistical anal3rsis, represent one effort to overcome 
difficulties arising out of temporal change. It is assumed 
that variations due to trend reflect the deep^eated changes 
that would introduce elements of heterogeneity into the 
particular universe of inquiry, and that deviations from 
trend may be made the bases of statistical inference. The 
effects df some temporal changes are doubtless removed by 
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this process. But the argument cannot justify the extension 
to a new time period of measiu*es of sampling error based 
on the study of another period, unless it can be established 
that no essential change occurred in the conditions affecting 
the phenomena in question. The probable errors involved 
in such extension, without the validation noted, are not 
capable of definition. For this extension would involve 
generalizing about one universe from the study of an- 
other. 

In the application of statistical methods proper choice 
of objectives, wise planning, and effective field work are 
of at least equal importance with skill in the use of statistical 
techniques. This is especially true as regards problems 
of sampling. Here chief emphasis falls on soundness and 
accuracy in the field work. The problems of field work are 
specialized and particular, arising out of specific problems 
and conditions. Appropriate special knowledge is needed 
for the selection and validation of the sample. 

Much may be done to strengthen a statistical induction 
by making actual statistical tests of the homogeneity of 
the population and of the stability of sampling results. 
By the study of successive samples the representativeness 
of statistical measures may be determined; and by testing 
the subordinate elements of a given sample, when broken 
up into significant sub-groups, the inherent stability of a 
sample may be cheeked. The uniformity of nature in a 
given field is assumed in every induction. The induction is 
strengthened by every piece of evidence that supports the 
assumption. 
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CHAPTER XV 


THE ANALYSIS OF VARIANCE 

The determination of degree of correlation between vari- 
ables involves, essentially, the comparison of measurements 
of variability. Thus, in the familiar equation 


we are comparing the dispersion about the fitted line of 
regression (Sy^) with the dispersion about the mean of the 
l/’s Again, if we work with the relation 



we are comparing the dispersion of the computed values 
of y about the mean of the y’s with the dispersion 

of the original observations about the mean of the y’s (o-»®). 
It is logical thus to compare measurements of variation . 
in applying correlation technique, for the purpose of th e 
Investigator is usually to test an hypothesis concerning th e 
forces res pnnHihip. for variation in the dependent variable. 
He is usually seeking an associated factor which may, on 
some rational basis, be assumed to influence the fluctuations 
of the variable he is treating as dependent. R. A. Fisher 
has developed a procedure to employ in the study of correla- 
tion which is based explicitly upon the analysis of variance. 
We deal in this chapter with certain applications of the 
finable and powerful instrument Fisher has forged. 

Comparison op Mbasttbeb op Vartabiut t 

We deal first with a simple comparison of two groups, 
in respect of variability. The prices of preferred and com- 

490 
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mon stocks, as quoted on the New York exchanges, may 
be compared, to determine whether they differ significantly 
in variability. Table 29, presented on a preceding page, 
showed the distribution of closing prices on July 25, 1936, 
of 66 preferred stocks, paying annual dividends of seven 
per cent. With this we noay compare the distribution of a 
like number of common stocks selected at random from 
those for which prices were quoted on the New York Stock 
Exchange on July 25, 1936. The required values are given 
in Table 111. 


Table 111 

Comparison of Preferred and Common Stocks in Resnect of Price 

Variation 



Degrees 

of 

freedom 

in) 

Sum of 
squares of 

Mean 

square 

Standard 

Common 

logarithm 

of 

standard 

deviation 

logioar 

Natural 

logarithm 

of 

standard 
deviation 
logs a 


deviations 

from 

mean 

deviation 

(variance) 

<r2 

deviation 

a 

Common 

stocks 

Preferred 

65 

99,327.28 

1,528.112 

39.09 

1.59207 

3.66590 

stocks 







(seven 
per cent) 

65 

30,812.20 

474 034 

21.77 

1.33786 

3.08056 





Difference * 

=>0.58534 


Each distribution includes 66 observations. (It is not 
essential to this comparison that the number of observations 
in the two distributions be equal.) In computing the mean 
square deviation we divide the sum of the squared deviations 
from the mean by n, the number of degrees of freedom, 
which is here oqual to one less than the total number of 
observations in each distribution, that is, to N—1. (More 
is said below about the determination of number of degrees 
of freedom.) The standard deviation of the common stocks, 
39.09, is materially greater than the corresponding figure, 
21 . 77 for preferred stocks, but we cannot tell by inspection 
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whether the difference is significant, or whether it merely 
reflects a fluctuation of sampling. A precise test may be 
Twadft by using the coefficient 2 as a measure of the difference 
in variability. 

This coefficient is equal to the difference between the 
natural logarithms of the two standard deviations. That 
is 

z = log, 0-1 - log, 0-2. 

It is to be noted that natural logarithms are to be employed. 
Common logarithms on the base 10 may be shifted readily 
to natural logarithms on the base e (2.71828) by using the 
factor 2.3026 as a multiplier. From the entries in the 
last column of Table 111 we derive .58534 as the value 
of 2 . 

If conamnn and nreferred stncTrs wore aliTfftj xdt.b rAappnt. 
to the diapftrsioTi o f 

large saTnpIftq an t,ba.t i=ifl.nrm1ing fluctuations did not affec t 

the measnrfta of Vfl rinnnp^ t.bft vnliiPi nf z wmilH hft zfirn. 

Is the value we have derived consistent with the hypothesis 
that the true value of 2 is zero? Coxild sampling fluctuations 
^one acco unt for a. dftvifl.t,inn aa graat, aa ,58.534 from a, 
true value of zero ? If the derived value of z is too great 
to be attributed to samplmg fluctuations, the hypothesis 
that common and preferred stocks are alike, with respect 
to the diversion of their prices, is untenable. 

To Hfttfirminfi wbfit.her the derived value of 2 is consisten t 
wi th the h^rpothegiB . that its tnip. vfl.1iiA jg zero, we mu st 
know something about the distribution of values of 2 . if 
these were computed from many samples drawn under 
the same conditions. Fisher has shown that this distribution 
is normal, or effectively so, when the two distributions 
being compared both include a large number of observations. 
This is also true when the two distributions include only 
a moderate number of observations, but with th and n 2 
equal or nearly equal. The standard deviation of a dis- 
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tribution of 2 ’s secured under these conditions, or the 
standard error of z, is a fimction of the two n’s. It may be 
derived from the relationship 

where tii and n* are the number of degrees of freedom in 
the two distributions. 

In the present example ni and n* are both equal to 65; 
the standard error of z is equal to the square root of the 
reciprocal of 65. We have 

(T. = V. 01538 = .124. 

The test of the hypothesis that the true value of z is zero 
reduces, then, to the question whether a value of .68534 
is likely to be drawn from a normally distributed population 
with a mean value of zero and a standard deviation of 
.124. A value of .58534 represents a deviation of 4.72 
standard deviations from zero (i.e., z/<r, = 4.72). A devia- 
tion as great as this occurs so seldom, in random sampling, 
that we may not accept the conclusion that the present 
value represents a chance deviation from zero. The result 
is not consistent with the hypothesis that the true value 
of z is zero. The dispersion of common stock prices is 
significantly greater than the dispersion of the prices of 
preferred stocks paying seven per cent dividends. 

To exemplify a different condition, we may compare the 
dispersion of prices of preferred stocks paying six per cent 
and of preferred stocks paying seven per cent dividends. 
We have 64 quotations on the former, 66 quotations on 
the latter, both relating to closing prices on the New York 
Stock and Curb Exchanges on July 25, 1936. The figures 
are ^ven in Table 112 on page 494. 

In this comparison the value of z is .02890. The standard 
error of z (the square root of half the sum of the two recipro- 
cals) is .12502. The coeflSicient z deviates from zero by 
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Table 112 


Comparison of Six Per Cent and Seven Per Cent Preferred Stocks 
in Respect of Price Variation 


Degrees Sum of 
of squares of 
freedom deviations 
(n) from mean 


Mean Natural 

square Standard logarithm 
deoiaiion deviation of standard 
{Tariance) a deviation 
0-2 loge a 


l/n 


Seven per cent 
preferred 

stocks 66 30,812.2 474.034 21.77 3.08066 . 0163846 

Six per cent 
preferred 

stocks 63 28,175.0 447.222 21 16 3.05166 . 0168730 

Difference ■=• 0.02890 Sum « .0312576 


an amount equal to about one fourth of the standard error 
of z {zj<T, = .23). This, of course, is a deviation that would 
occur very frequently in a normally distributed variate 
with mean value of zero. The result is, therefore, consistent 
with the hsqjothesis that the true value of « is zero. There 
is no significant difference between six per cent and seven 
per cent preferred stocks in respect of the dispersion of 
their quoted prices. 

The Testing op Variability between Classes 

The comparison of standard deviations provides a means 
of answering questions of another type. Measurements of 
changes in the average selling prices of products of manu- 
facturing industries may be used to exemplify the procedure. 
If we classify manufacturing industries into those producing 
perishable, semi-durable, and durable goods, and compute 
an average of changes occurring between 1929 and 1933 
in the selling prices of the products of each of these categories, 
we obtain the index numbers given in Table 113. 

The average decline in prices was much less among durable 
manufactured goods than among goods of the other classes; 
semi-durable goods suffered the greatest loss. The range 
of variation among the three averages is considerable, but 
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Table 113 


Measurements of Average Changes in Selling Prices, 1929-1933, in 
Three Groups of Manufacturing Industries 


Class of industry 

Producing perishable goods 
Producing semi-durable goods 
Producing durable goods 
All industries 


No. of 

Index of Belling pr 

industries 

1929 

1933 

34 

100 

69 81 

26 

100 

66 41 

25 

100 

78.96 

85 

100 

71 46 


on the basis of the evidence here given we are not able to 
say whether the observed differences are due to chance, 
merely, or whether the prices of these several classes of 
goods were subject to the play of quite different forces, 
during the period here covered. An objective test is needed, 
before we may assiune that the observed differences are 
significant. 

For the application of such a test we need a measure of 
variation which is independent of the principle of classifica- 
tion here employed. How much might a series of price 
relatives for 1933, on the 1929 base, be expected to vary 
as a result of the play of chance? (By “chance ” we here 
mean the mass of causes u nrelated fa the factor at relative 
duraMlUv.) A measmre of the strength of such causes is 
provided by the variation within the three classes we have 
set up. The method used m measuring the variation within 
these classes is indicated in Table 114 on page 496. 

It will be understood that the deviations which, in 
squared form, enter into the sums in the last column are 
the differences between individual items and the means 
of the classes in which those items fall. Thus the relative 
measuring the average selling price of products of the meat 
packing industry in 1933 was 44.90 on the 1929 base. 
This industry falls in the perishable goods group. The 
difference between 44.90 and 69.81 is 24.91. The square 
of this, or 620.5081, is one of the 34 items making up the 
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Table 114 

lUuatrating ike Measurement of Variation within Classes 


(1) 

(2) 

(3) 

Mean of price 

(4) 

Sum of squares of 

Class of industry 

No. of 

relatives 

deviations of individual 

industries 

(1933 on 1929 
base) 

price relatives from 
class mean 

Producing perishable 

goods 

34 

69 81 

6,464 0275 

Producing semi- 

durable goods 

26 

06 41 

3,375. 1849 

Producing durable 

goods 

25 

78.96 

5,725.6916 

All industries 

85 


15,564.9040 


figure 6,464.0275, the entry for perishable goods in the 
last column of the preceding table. The sum of the entries 
in this last column, 15,564.9040, represents variation in 
price changes within the three classes. It is not influenced 
by factors of perishability or durability, since the total is 
affected only by variation among perishable goods, variation 
among semi-durable goods, and variation among durable 
goods. 

Eighty-five items enter into this total. However, only 
82 degrees of freedom are present. The 34 perishable 
goods possess 33 degrees of freedom to vary, the 26 semi- 
durable goods possess 25 degrees of freedom, and the 25 
durable goods possess 24 degrees of freedom. For the 
standard deviation defining variability witidn classes we 
have, therefore 

y/lMM = 13 . 78 . 

This figure provides us with a yardstick, a measure of the 
degree of variation that is independent of the principle 
of classification employed in distinguishing perishable, semi- 
durable, and durable goods. This measures the variation 
due to the mass of floating causes known as chance. 
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With this standard we may compare the differences 
between the three class averages presented in Table 113. 
The magnitude of these differences may be defined by 
a single measurement, a standard deviation. In its computa- 
tion the deviation of each class mean from the grand mean 
is measured , and the square of this rip.v]>,tion in TrmltipliH 
b y the number of items in the class in question. The proce- 
dure is illustrated in Table 115. 


Table 115 

Illustrating the Measurement of Variation between Classes 


(1) (2) (3) (4) 

Mean Deviation of 


Class of No. of 
industry industries 


Producing 
perishable 
goods 34 

Producing 
semi- 
durable 

goods 26 

Pi^ucing 
durable 

goods 25 


of 'price class mean 
relatives from mean of 
(1929 all 
= 100) observations 


69.81 - 1.65 


66.41 - 6 05 


78.96 + 7.50 


(5) 

(6) 

Square of 

deuioMon of 

Weighted 

class mean from 

squared 

mean of all 

deviation 

observations 

(2) X (5) 

(4)* 

2 7225 

92.5650 


25 5025 

663 0650 

56.2500 

1,406 2500 


2,161.8800 


The sum of the entries in column (6), 2,161 . 8800, measures 
the total variation between classes. Although wei^ts are 
used in getting this total, jjhe differences relate to thre e 
separate averages, only, and but two degrees of freedom 
are represented in the total. As a measure of the degree 
of variation between the three broad categories we have 
set up, we have 
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This we may take as a measure of the difference in degree 
of change in selling price from 1929 to 1933 that appears 
to be related to the relative durability of products. 

The next step involves the formal testing of this figure 
against the standard provided by the measures of variation 
within classes. This test is applied in Table 116. Certain 
necessary calculations are also indicated. 


Table 116 


Comparison of Measures of Variation 


Degrees 

Nature of of 
variability freedom 
n 


Sum of 
squared 
deviations * 


Mean 

square Standard j 
deviaMon demiion 

ff* (T 


Log, a 


Between 

classes 

Within 

classes 

Total 


2 2,161 8800 1,080 9400 32 88 1 51693 3.49288 

82 15,564 9040 189 8159 13 78 1 13925 2.62324 

17,726 7840 Difference = 2 = 0 86964 


The test reduces, it is clear, to a comparison of two 
measures of variability. One, the standard of comparison, 
is the measure of variance mthin classes, a measure com- 
pletely independent of the perishability or durability of 
the product. The other is a measure of variation between 
classes. Such variation might be due to the same general 
mass of causes responsible for variation within classes, or 
it might be due to special forces related to differences in 
the durability of the goods in question. If the former 
explanation is correct, the two measures of variation should 
be of the same order of magnitude, with due allowance 

^The figure 17,726.7840, which, is the sum of the squared deviations of 
individual ohservatioiis from their respective class means and of the squared 
deviations of the several class means from the mean of all the observations, is 
equal to the sum of the squared deviations of all the individual observations 
from the mean of all the observations. In the table the total has been broken 
into two components, representing variability between classes and variability 
within classes. 
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for sampling fluctuations. If the second is the correct ex- 
planation the two measures of variation may differ appreci- 
ably in magnitude. The test, therefore, reduces to the 
question: Is the variation between classes significantly ^- 
feren^ from the variation within classes, account being 
taken of the degrees of freedom present in the two cases? 

This question could be answered with reference to the 
standard error of z, provided the distribution of z be normal, 
or approximately normal. This is the case when the n’s 
that measure the number of degrees of freedom are both 
large or when, though of moderate value, they are equal 
or nearly equal. This condition prevailed in the examples 
cited earlier. It is not met in the present instance, so we 
may not with accuracy employ the method of estimating 
and utilizing the standard error of z that was used in the 
earlier case. "When the numbers of degrees of freedom are 
imequal and relatively small, as in this case, tests of signifi- 
cance may be most readily made with reference to a tabula- 
tion of values of z, prepared by R. A. Fisher. This tabulation 
gives, for various values of ni and rh, values of z that would 
be exceeded 5 tunes out of 100, as a result of chance, if 
the true value of z were zero; it also gives one per cent 
values of z, i.e., values of z that would be exceeded 1 time 
out of 100, under conditions of random sampling, if the 
true value of z were zero. These two sets of values axe 
reproduced in Appendix Tables "VI and "VII of this book, 
through the courtesy of Dr. Fisher and Oliver and Boyd, of 
Edinburgh, his publishers.^ 

In the present example the value of z, defining the degree 
of difference between the two measures of variation in 

^ Uses of the function z are discussed in R. A. Fisher^s book, Statistical Methr 
ods for Research Workers, Edinburgh, Oliver and Boyd, sixth ed., 1936. 

A table similar to Fisher's «-table, but relating to an alternative measure F, 
has been constructed by Greorge W. Snedecor. F is derived directly from the 
variances (i.e., the values of that are being compared; it is the ratio of the 
larger of the two variances to the smaller. For a table of values of F s*ixd a 
discussion of its uses see George W. 8nedecor, Statistical Methods, Amea^ 
Iowa, Collegiate F^ess, 1937. 
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Table 116, is .8696. In entering the 2 -table for the purpose 
of testing this measurement, the number of degrees of 
freedom corresponding to the larger of the two measures 
of variation compared is taken as mi; ns is the number 
of degrees of freedom corresponding to the smaller of the 
two measures. This is a necessary procedure, with reference 
to the table as constructed. In the problem that now con- 
cerns us mi = 2, ma = 82. 

For mi = 2 and ma = 60, the 1 per cent value of 2 is 
.8025; for mi = 2 and ma = «, the 1 per cent value of z 
is .7636. Interpolating, we obtain .7920 as the 1 per cent 
value of 2 for ni = 2, ma = 82.^ If the true value of z were 
zero, we should expect a value as great as .7920, or greater, 
to occur as a result of chance only 1 time out of 100. The 
present value of z materially exceeds .7920; the probability 
of a value as great as this occurring as a result of chance, 
if the true value of z were zero, is less than 1 out of 100. 
The results of the test are not, therefore, consistent with 
the hypothesis that the true value of 2 is zero. The differ- 
ences between the three class averages shown in Table 113 
are too great to be attributed to chance. We may conclude 
that the price movements of perishable, semi-durable, and 
durable manufactured goods between 1929 and 1933 were 
significantly different. 

^ Interpolation in the a-table is based upon direct proportions among the 
reciprocals of the n’s. In the above case 

for ni = 2, ns = 60: the 1 per cent value of a « .8025 1/na - 1/60 * .0167 
for ni = 2, ?i 2 » 00 ; the 1 per cent value of a = .7636 l/ns » 1/ oo = 000 0 

A «= 0389 A = T^7 

We must find the 1 per cent value of a corresponding to ni = 2, n* « 82, 
For l/ns we have 

1/82 = ,0122. 

The difference between 1/82 and 1/ for which we must interpolate between 
the given values of a, is .0122 - .0000 « .0122. The required 1 per cent value 

( 0122 \ 

X .0389j = .7920, 

The process of interpolation on the ni scale, if required, would be similar. 
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Vabiance Analysis in the Mbasueement of 
Relationship 

The procedure employed in the comparison of measures 
of viability is applicable to the measurement, of Rorrelation. 
Indeed, using this technique it is pos^e to employ a 
systematic procedure that is of great value in revealing 
the character and degree of the relationships prevailing 
between variable quantities. This procedure is illustrated 
in the next section. 

The method employed in applying to a typical correlation 
problem the method of analysis based on comparison of 
variances may be illustrated with reference to the data of 
alfalfa yield previously studied. These are presented in 
Table 117. 


Table 117 

Summary of Results Secured in Experiments with Alfalfa 

(The measurements in the body of the table measure yields, in tons per acre, 

in 44 experiments) 

IncJm of irrigation water applied 


0 

12 

18 

24 

30 

36 

48 

60 

2 35 

4.31 

5 69 

6.00 

7 53 

7.58 

8.05 

5 55 

2 75 

4 78 

6.46 

6.89 

7.97 

8 22 

8 45 

7.25 

2.89 

4.84 

7.02 

7.96 

8.32 

8 63 

8.63 

10.17 

3.85 

5.83 

8.02 

8.32 

9.43 

9 33 

8 83 

10.70 

5 52 

6 51 


8.38 

9 54 

9.38 

9.52 



Average ^ 11.06 12 48 10 62 

yield 3.88 5.63 6.80 7.92 8.98 9.27 9.02 8.42 7.48 

The average yield of alfalfa, in these 44 experiments, 
was 7 . 48 tons per acre. But there was rather wide variation 
among the results. The sum of the squares of the deviations 
of the 44 observations from the mean is 228.33. This 
sum sets our problem. We should like to find reasons 
for this variation. 
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TESTING FOB THE EXISTENCE OF COERBLATION 

The observations are set up above in a form suited to 
the testing of one hypothesis concerning the factors affect- 
ing alfalfa yield. The data are arranged in eight arrays, 
classified according to the depth of irrigation water applied. 
This depth varied from 0 to 60 inches. Variations in yield 
appear to be associated with variations in amount of water 
applied. As a basis for our procedure we set up the hypothe- 
sis that there is no such association. To test this hypothesis, 
we may break the sum that measures the total variation 
of yields into two parts measuring, respectively, the variation 
within arrays and the variation between arrays. 

To determine the total variation within arrays, the devia- 
tion of each observation from the mean of the array in 
which it falls is measured. The sum of the squares of these 
deviations, for aU the arrays, is the desired total. Thus, 
in the first array of Table 117, the mean is 3.88 tons. 
The deviation of the first observation, 2.35, from this 
figure, is — 1.53; its square is 2.3409. The deviation of 
the second observation, 2.75, is — 1.13; its square is 
1.2769. Determining in similar faduon the deviations of 
the four other observations in that array from the mean 
of the array, squaring these, and adding the six squared 
values, we have 11.5320 as the sum of the squares of the 
deviations in the first array. Performing similar calculations 
for the seven other arrays, and adding the eight sums thus 
secured, we have a figure of 76.39. This is the total varia- 
tion withiu arrays. For convenience we may refer to thia 
as component A of the total variation. 

In determining the total variation between arrays, the 
deviations of the means of the various arrays from the 
mean of all the observations are measured and squared, 
and the weighted sum of these squares is secured. Weights 
are based upon the nxunber of observations in the several 
arrays. Thus the mean of the first array, 3.88 deviates 



A TEST OF CORRELATION 


503 


from the mean of all the observations, 7.48, by 3.60; 
the square of this is 12.9600. Multiplying by six (the 
number of observations in the first class), we have 77.7600. 
Securing similar weighted figures for the seven other arrays, 
and adding, we have 151.94 as the total variation between 
arrays. This we may call component B. 

In breaking up the total variation into two components^ 
we have distinguished variations in yield that are definitely 
not related to differences in depth of irrigation water applied, 
from variations m yield that may or may not be related 
to irrigation differences. Within the first array, including 
six experiments on plots to which no irrigation water was 
applied, yields varied from 2.35 tons to 5.94 tons per acre. 
The total variation within this array (the sum of the squares 
of the deviations from the mean of the array) amoimted 
to 11.5320. Smee the irrigation factor was constant, this 
sum measures variation which is completely independent 
of changes in irrigation. This is true also of the figure 
76 . 39, measuring total variation within all the eight arrays 
set up in Table 117. Differences in soils and innumerable 
minor factors combined to create variation within these 
arrays. The figure 76.39 measures the play of that host 
of undefined forces to which we give the name chance. 
The one specific factor which does not affect this figure is 
irrigation. We have measmed the variation m such a 
way that irrigational differences do not enter. 

Irrigational differences do enter definitely into the varia- 
tion between arrays. Indeed, it may be the dominant 
factor in this variation, which is measured by the figure 

1 The sum of the two components is, of course, equal to the total. 

Variation within arrays (Component A) 76.39 

Variation between arrays (Component B) 151,94 

Total variation 228 . 33. 

For a demonstration of this relationship see note, pp. 418-9. 

To ensure full consistency between components A and B and the total 
(and among the sub-divisions of B later de^ed), when these quantities are 
independently computed, it is necessary that all computations be carried to 
more decimal places than are customarily retained. 
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151.94. But of this we cannot be sure. For the means of 
the eight arrays differ among themselves not only because 
of differences in the amounts of irrigation water applied to 
the different plots. To yield differences due to the irrigation 
factor are added yield differences due to the innumerable 
other forces that influence alfalfa yield, the forces ws lump 
together as chance. For chance factors n ffpint the meana 
of the various arrays , and so affect the variation bet^^n 
arrays, just as they affect the variation within arra 3 ^fi^ 
the e^eriment was designed, the influence of irr^ational 
differences is present only in the variation between arrays, 
but the influence of “chance” is present in both the variation 
within arrays and the variation between arrays. 

In this fact is found the key to our problem, and the 
instrument for testing our hypothesis. For, in so far as 
chance alone is operative, the variation between arrays 
would be expected to be of the same order of magnitude 
as the variation within arrays. The figures we have so 
far examined indicate that the variation between arrays 
is greater than the variation within arrays. But this may 
be a purely fortuitous result. The apparent increase of 
3 deld with increased irrigation may be entirely a chance 
phenomenon, similar to a run of heads in tossing a coin. 
This we must test. We must determine whether the forces 
responsible for variation between arrays are the same as 
the forces responsible for variation within arrays. 

The hvpo thesia wfi shall tftat, and which may of course be 
disproved, is that the forces responaihlp, for varintio n bfttwftPTi 
arrays are tba HatTift as t.bft fnrfifts rfisyoTiaihlft for variation 
within arrays ; in other words, that there is no association 
between depth of irrigation water applied and alfalfa yield. 
The nature of the test to be applied hfl^ been indicated 
in the preceding sections. We shall compare two measures 
of variation, to determine whether they are of the same 
order of magnitude. But before this test is applied, account 
must be taken of the number of degrees of freedom pre- 
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vailing in each case. This concept calls for brief explana- 
tion. 

If the data of alfalfa yield related to but one plot of 
land, in one year, there would be no variation. A single 
observation would coincide with the mean, and the standard 
deviation would be zero. With a second observation oppor- 
tunity for variation arises. But we may think of it as a 
single opportunity. With but two observations there is 
but one degree of freedom to vary. With three observations, 
two opportunities to vary are given; there are two degrees 
of freedom. In problems of this sort the number of degrees 
of freedom is equal to iV — 1. Our present example includes 
44 observations; hence the total variation 228.33 represents 
the resultant of 43 degrees of freedom. 

How are these 43 degrees divided between the two 
components, A and R? As regards variation within arrays, 
this may be readily determined by reference to Table 117. 
Variation within arrays, it will be recalled, was measured 
with reference to the means of the various arrays. In the 
first array, containing six observations, there exist five 
degrees of freedom to vary from the mean of that array. 
The same is true of the arrays relating to 12, 24, 30, 36, and 
48 inches of irrigation water. In each of the arrays relatiixg 
to 18 and 60 inches of water there are but four observations, 
with three degrees of freedom. The total of these degrees 
of freedom is 36. Variation between arrays was determined 
by measuring the deviations of the means of eight arrays 
from the general mean of the distribution. Since eight 
different values are involved, there are seven degrees of 
freedom. (The fact that weights were employed in securing 
the total variation between arrays does not affect the deter- 
mination of degrees of freedom.) The 36 and the 7, combined, 
use up all the 43 degrees of freedom entering into the total 
variation. 

Knowing these degrees of freedom we may now reduce 
the measm^s of variation within arrays and of variation 
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between arrays to comparable terms, and determine the 
significance of the difference between them. This is done 
in Table 118. This table and others following differ somewhat 
from those employed in similar comparisons in the opening 
sections of this chapter. In the earlier tables variability 
was measured in units of the standard deviation, and the 
function 2 was derived from the relationship 

2 = log, (Ti — log, at. 

It is often more convenient to perform the necessary calcula- 
tions in terms of the variance, that is, of «r*, and to derive 
2 from the relationship 

_ log,<ri* - log,(r8* 

2 =: 

2 

The procedures lead to the same result, of course, since 
half the difference between the logarithms of the squared 
standard deviations is equal to the difference between the 
logarithms of the standard deviations, but the use of squared 
measurements eliminates one step in the calculation. 


Table 118 

A Test of the Existence of Correlaiion 


Nature of 

No. of 
degrees of 

Sum of 

Mean 

square 

Natural 
logarithm 
of mean 

variability 

freedom 

squares 

(variance) 


(n) 

0-^ 

square 
log, a* 

Within arra37s 




(Component A) 
Between arrays 

36 

76.39 

2.12 

0.7514 

(Component B) 

7 

161.94 

21.71 

3.0778 


Difference = 2.3264 
a = 1.1632 

When we divide the sums of the squares by the corre- 
sponding figures defining degrees of freedom, we have com- 
parable measures of variance. Now it appears that the 
variance between arrays (21.71) is distinctly greater than 
the variance within arrays (2. 12), in disproof of the hypothe- 
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sis that the same forces account for the two variances. 
But we have a precise test to employ in determining 
whether these two variances are of the same degree of 
magnitude, within sampling limits. This is the coefficient 
z, which is half the difference between the natural logarithms 
of the two variances. In the present case, z is ^qual to 

or 1.1632. 


If the forces responsible for variation within arrays were 
the same as those responsible for variation between arrays 
(that is, if our hypothesis were true), the value of z would 
be zero, with a sample of infinite size. The value of z we 
have secured is not zero. This may be proof that our 
hypothesis is false, or it may merely be a resxilt of sampling 
fluctuations. The value of z might be zero in a given infinite 
population, but a random sample would be expected to 
yield results deviating considerably from zero. We wish 
now to take accoxmt of sampling fluctuations, in determining 
whether the result we have seemed is consistent with the 
hypothesis that the true value of z is zero. 

In determining the significance of the present results 
we enter Appendix Table VI with wi (the number of degrees 
of freedom corresponding to the larger variance) equal to 
7 and Ui equal to 36. Interpolating in Table VI, we find that 
the 1 per cent value of z corresponding to the stated values 
of ni and Wj is .5780.^ A value as great as this or greater 


^ It is necessary to interpolate on both scales of the 2 -table. First, following 
the procedure indicated on a preceding page, we interpolate in respect of n%. 
We obtain 

for ni ~ 6, n* « 36, the 1 per cent value of 2 *= .6047; 1/6 « . 1667 
for ni S3 8, n* ■» 36, the 1 per cent value of 2 — .5580; 1/8 = . 1250 

A « .0467 A - .0417. 

We must now interpolate on the ni scale, since the degrees of freedom are 
wi » 7, na « 36. For 1/ni we have 1/7 3x .1429. The difference between 1/7 
and 1/8, for which we must interpolate between the given values of 2 , is 
.1429 - .1250, or .0179. 

( 0179 \ 

' X .0467j « .5780. 
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would occur only 1 time out of 100, as a result of sam- 
pling fluctuations, if the true value of z were zero. The 
actual value, 1.1632, far exceeds the 1 per cent value 
of z. The evidence strongly indicates that z deviates from 
zero not because of the play of chance, but because the 
forces issponsible for variation between arrays are of a 
different order from those responsible for variations within 
arrays. We are ixistified in concluding that our results 
are not consistent with the assumption that the true value 
of z is zero. The hypothesis that the forces responsible 
for variation between arrajra are of the same character 
as those responsible for variation within arrays is not 
tenable. The results indicate the presence of a real con- 
nection between alfalfa yield and depth of irrigation water 
applied. 

TESTING THE HYPOTHESIS OP A LINEAR RELATIONSHIP 

Since it appears that there is a relationship between 
these two variables, it is now in order to secure an acceptable 
function, defining the relationship in quantitative terms. 
We may do this by testing, in turn, various hypotheses 
concermng the form of this function, until we secure one 
with which the observations are not inconsistent. We shall 
start with the hypothesis that there is a linear relationship 
between alfalfa yield and depth of irrigation water applied.^ 

The first step in applyiig the present test is to fit a 
straight li ne to the means of the eight arrays shown in' 
Table 117. Variation among thes^means (component J5 
of the total variation) reflects the presence of correlation 

^ Each hypothesis tested should be rational, acceptable on logical grounds. 
If we are thinking of general relationships, prevailing over the entire range of 
possible observation, the assumption of a straight-line relationship between 
alfalfa yield and amount of irrigation water applied is not tenable. For it is 
not to be espected that increased irrigation ^11 increase yield without limit. 
In the present case we test the hypothesis of a linear relationship in order that 
the demonstration of procedure may be systematic and complete, although that 
hypothesis is not a rational one, even within the range of the present observar 
' tions. 
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between alfalfa yield and irrigation water applied. If the 
correlation is perfectly linear, all these class means will 
fall on the straight line; all the variation between arrays 
win be accounted for by the hypothesis of a linear relation- 
ship. If the relationship is substantially, though not per- 
fectly, linear, the portion of component B not accounted 
for by linear regression will be insignificant. If the regression 
is not truly linear the residue of B not accounted for (i.e., 
the scatter of the means of the arrays about the straight 
line of regression) will be too great, and some other hypothe- 
sis concerning the character of the relationship between 
alfalfa yield and irrigation water applied must be employed. 

A straight fine fitted by the method of least squares 
to the means of the eight arrays is shown in Fig. 82 on 
page 406. The equation to the line is 7 = 5 . 038 -f 0886Y, 
where Y is alfalfa yield in tons per acre and X is depth of 
irrigation water applied, in inches. [We should note that 
in the fitting process the mean of each array is weighted 
by the number of observatioi^ in that array. This means, 
merely, that six points are assumed to have coordinates 
of 0, 3.88 (equal to those of the mean of the first array), 
that four points are assumed to have coordinates of 18, 6.80 
(equal to those of the mean of the third array), etc.] In 
Table 119 on page 510 are given the values of the means of 
the various arrays, and the corresponding computed values, 
as derived from the straight line of regression. 

It is clear from the graph and the table that the fit of 
the strai^t line to the means of the arrays is not perfect. 
The inadeauacv of the fit is measured by tbp, mnn of the 

j^liared nf tTiA nipoe rai>ong frntn nnrrPHpnnriingr 

c omputed vfthiftH (ftaftb squared deviation being weighted 
by the number of obs^ations in the given class). This 
sum is equal to 44.79. 

This sTun, to which we noay refer as Bi, is one component 
of B, the variation between arrays. 'H is that portion of 
the variation between arrays that is not accounted for 


510 


ANALYSIS OF VARIANCE 


Table 119 


Alfalfa Yield and Depth of Irrigation Water 
(Clasts means and values based on linear relationship 
F - 6.038 + .0886X) 


(1) 

(2) 

(3) 

(4) 

(5) 

Difference 

(6) 

(7) 

Inches 

No. of 
obser- 
vatiOTis 

Mean 

Estimated 

between mean 



of 

yield 

yield, linear 

yield of class 



water 

Cf 

relationship 

and estimated 



(class) 

doss 

(tons) 

yield 







(Xv-Vc) 




f 

Yp 

Vc 

d 


fd* 

0 

6 

3.88 

5.04 

- 1.16 

1 3456 

8.0736 

12 

6 

5 63 

6 10 

- .47 

2209 

1.3254 

18 

4 

6 80 

6 63 

+ .17 

0289 

1156 

24 

6 

7 02 

7 16 

+ 76 

5776 

3 4656 

30 

6 

8 98 

7.70 

+ 1.28 

1 6384 

9.8304 

36 

6 

9 27 

8 23 

+ 1 04 

1.0816 

6 4896 

48 

6 

9 02 

9 29 

- .27 

.0729 

.4374 

60 

4 

8 42 

10.36 

- 1.94 

3 7636 

15 0544 
44.7920 


by tbe h 3 T)othesis of a linear relation between yield and 
irrigation water. The method of deriving the other compo- 
nent of R is shovm in Table 120. 

The sum 107.15, to which we may refer as is that 
component of the vfl.riatinTi bfitwftp.n arrays which is 
accounted for by the hypothesis of linear regression. The 
items m col. (3) of Table 120 differ from 7.48, the 
mean of all the observations, for the reason suggested 
by the h 3 rpothesis. They differ, on our present assump- 
tion, because with increased applications of water yield 
increases in a manner defined precisely by the equation 
Y = 5 . 038 -1- . 0886X. The sum of these variations, 107 . 15, 
represents, on this assumption, the full effect on alfalfa 
yield of variations of irrigation applications. 

The total of the two sums to which we have referred as 
Bi and B» is equal to 151.94, the total variation between 
arrays. Working on the hypothesis that the variables 
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Table 120 

Computation of Variation in Alfalfa Yield Attribviable to Irrigation 
Differences on the Hypothesis of Linear Regression 


(1) 

(2) 

(3) 

(4) 

(5) 

Difference 

(6) 

(7) 

Inches No. of 

Estimated 

Mean yield, 

between mean 
yield and 
yield esti- 
mated on lin- 
ear hypothesis 



of 

obser- 

yields linear 

all obser- 



water 

vations 

retationshiy 

{torn) 

vations 











f 

Vc 

Y 

d 

d* 

/d> 

0 

6 

5 04 

7 48 

- 2 44 

5 9536 

35.7216 

12 

6 

6 10 

7 48 

- 1 38 

1 9044 

11 4264 

18 

4 

6.63 

7 48 

- 85 

7225 

2 8900 

24 

6 

7.16 

7 48 

- 32 

1024 

6144 

30 

6 

7 70 

7.48 

+ 22 

0484 

2904 

36 

6 

8 23 

7.48 

+ .75 

5625 

3 3750 

48 

6 

9.29 

7 48 

+ 1 81 

3 2761 

19.6566 

60 

4 

10.36 

7.48 

+ 2.88 

8 2944 

33 1776 


107.1620 


with which we are dealing stand in a linear relationship, 
we have broken the component B of the total variation 
into two portions. One of these (Bi) measures the variation 
between arrays that is accounted for by the linear hypothesis; 
the other (Rj) measures the variation between arrays that 
is not accoimted for by that hypothesis. We should expect 
some departiure from linearity in a sample such as ours, 
even though it were drawn from a universe marked by a 
perfect linear relationship. But there are limits to the 
deviations that noight reflect fluctuations of sampling. The 
question we now face Js whethe r Bi i s small enough to be 
accepted as the resultant of random factors, or whether 
it is so large as to represent a breakdown of our hypothesis. 

In our earlier discussion we noted that component A 
of the total variation measured the influence of a host of 
random forces affecting alfalfa yield, forces other than the 
irrigation factor. Component A, therefore, serves as an 
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index of the magnitude of random forces, and hence as a 
standard defining the probable limi ts of sampling fluctua- 
tions, in so far as these are present in component B. We 
may use component A, which relates to variation within 
arrays, as a yardstick in determining whether B 2 is attribut- 
able to fluctuations of sampling, or whether it is too large 
to be so explained. 

In comparing components A and B 2 account must be 
taJcen of the number of degrees of freedom present in each. 
This has already been established for A. The following 
tabular summary of the operations just performed may help 


to explain the relations involved for Bi. 



Nature of variability 

No. of degrees 
of freedom 

Sum of 
squares 

Mean 

square 

Between arrays, due to linear regres- 
sion (Component Bi) 

Deviations from strai^t line of re- 

1 

107.15 


gression (Component Bi) 

Total variation between arrays 


44 79 

7.47 

(Component B) 

7 

151.94 



The seven degrees of freedom entering into component B 
are divided, one to component Bi and six to component B^. 
That the points on a straight line vary from one another 
with one degree of freedom is clear from a consideration 
of a linear equation y = a + bx. That the values of y 
may differ is due to the presence of the coeflScient I, which 
defines the slope. If 6 were zero, the equation would define 
a horizontal hne, with values of y constant. It is the slope 
that constitutes the one degree of freedom among points 
defined by a linear equation. With respect to Ra, we are 
dealing with eight points, to which a straight line has 
been fitted. If there were but two points both of them 
would lie on the line; there would be no possibility of 
deviation. V^th three points, one degree of freedom to 
deviate is introduced; with eight points there are six degrees 
of freedom. The degrees of freedom to deviate from any 
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fitted cvirve are obviously equal to the number of points 
to which the curve is fitted, less the number of constants 
in the equation to that curve. 

Dividing 44.79 by 6 we may secure, then, the value of 
the variance (the mean square) comparable to the variance 
of component A. A test of our h 3 q)othesis again reduces to 
a comparison of variances. This appears in Table 121. 

Table 121 


A Test of the Hypothesis of linear Relationship 


Nature of variaUlity 

Degrees of 
freedom 

Mean square 
{variance) 

Natural logarithm 
of mean square 


n 

(T® 

log,a* 

Within arrays (Compo- 
nent A) 

Deviation from straight line 

36 

2.12 

.7514 

of regression (Compo- 
nent B 2 ) 

6 

7.47 

2 0109 



Difference = 1 2595 




z == 6298 


The variation within arrays reflects the play of random 
factors, independent of irrigation. The force of these factors 
is indicated by a variance of 2.12. If similar random factors, 
independent of irrigation, were responsible for the deviations 
of the means of the eight arrays from the straight line of 
regression, we should expect the variance that measures 
such deviations to be of the same order of magnitude. 
Actually it is much greater, 7.47. But we cannot say, 
from inspection, that the difference between the two vari- 
ances is not due to fluctuations of sampling. An accurate 
test is needed. We may compute the coefficient z, half 
the difference between the natural logarithms of the two 
variances, and apply such a test. 

From the values given we secure a value of z equal to 
.6298. In determining whether this value is significantly 
different from zero, xise must be made again of Fisher’s 
tables. For the values of wi and n% are relatively small 
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flTiH unequal, and the distribution of z under these conditions 
would not be sufficiently close to the normal type to justify 
the use of its standard deviation. Entering Appendix 
Table VI with ni equal to 6, %2 to 36, we find that the 1 per 
cent value of z is . 6047. We take this to mean that, if the true 
value of z were zero, random sampling fluctuations would 
be expected to give a value of z as great as . 6047, or greater, 
only one time out of 100 trials. The actual value of z in 
the present instance is greater than .6047. Only rarely, 
less frequently t,hfl.Ti one time out of 100, would chance 
account for a value of z as great as the one observed. We 
conclude, therefore, that random forces, of the type respon- 
sible for variation within arrays, are not responsible for the 
deviations of the means of the eight arrays from the straight 
line of regression. These deviations are too great to be con- 
sistent with the hypothesis that there is a linear relationship 
between alfalfa yield and depth of irrigation water. This 
equation fails to account, adequately, for the observed 
variation between arrays. 

TESTING THE HYPOTHESIS OP A CURVILINEAR RELATIONSHIP 

We may now test the hypothesis that a power curve 
of the second degree (Y = a + hX + cX^) defines the rela- 
tion between alfalfa yield and depth of irrigation water 
applied. The procedure is identical with that followed 
in the case of the straight line. By the method of least 
squares we determine the best values of the constants in 
an equation of the desired form. The curve is fitted to 
the means of the eight arrays, each weighted by the number 
of observations in that array. The derived equation is 
Y = 3.539 -t- .2527X — .002827X®. The curve appears 
graphically in Fig. 82, and the computation of the sum 
of the squared deviations from it is shown in Table 122. 

The inadequacy of the fit is measured this timp! by the 
figure 4.61, the sum of the squared deviations from the 
power curve of the second degree. This sum, to which we 
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Tablb 122 

Alfalfa Yield and Depth of Irrigation Water 
(Class means and values based on a power curve of the second degree) 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

Inches 




Difference 




iKIi fjUfiv 

yield 

of 

Estimated 

between 



of 

No. of 

yield, from 

mean yield 



water 

obser- 

equation 

of class 



(doss) 

vations 

ObU/OO 

(tons) 

(tons) 

and esti- 
mated yield 





7, 

Vc 

Y,-yc 




f 



d 

d* 

foP 

0 

6 

3.88 

3.54 

+ 34 

1156 

6936 

12 

6 

5 63 

6.16 

- .53 

.2809 

1 6854 

18 

4 

6.80 

7.17 

- .37 

.1369 

5476 

24 

6 

7.92 

7.98 

- .06 

0036 

0216 

30 

6 

8 98 

8 58 

+ .40 

.1600 

9600 

36 

6 

9 27 

8 97 

+ .30 

.0900 

.5400 

48 

6 

9.02 

9.16 

- .14 

.0196 

1176 

60 

4 

8.42 

8.52 

- .10 

0100 

0400 


4 6058 


may refer as B 4 , is a component of B, the variation between 
arrays. It is the portion that is not accoimted for by the 
hypothesis of a curvilinear relationship, of the type assumed, 
between alfalfa yield and irrigation water applied. The 
other component of B is derived by the method indicated 
in Table 123 on page 516. 

We may designate by Bz the sum 147.32. This is the 
component of the variation between arrays that is accounted 
for by the hypothesis of a relationship defined by a second 
degree curve. The items in col. (3) of Table 123 differ 
from the mean of all the observations, on our present 
assumption, because alfalfa yield varies with increased 
applications of water in a manner defined by the equation 

7 = 3.539 + .2627Z - .002827X*. 

We have again broken B, the total variation between 
arrays, into two components, Bj representing the influence 
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Tabls 123 


Compuiation of Variation in Alfalfa Yield AttribiUable to Irrigation 
Differences on the Hypothesis of a NonrlAnear Regression 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

Inches 

No. of 

Estimated 

yield, 

equation of 
second degree 

Mean yield, 




water 

cbser- 

vations 

aO, obser- 
vaiions 






Vc 

Y 

yc-Y 




f 



d 


/d* 

0 

6 

3.54 

7 48 

- 3.94 

15 5236 

93 1416 

12 

6 

6 16 

7 48 

- 1.32 

1 7424 

10 4544 

18 

4 

7.17 

7 48 

- .31 

0961 

.3844 

24 

6 

7.98 

7.48 

+ 50 

2500 

1.6000 

30 

6 

8.58 

7 48 

+ 1 10 

1.2100 

7.2600 

36 

6 

8 97 

■ 7 48 

+ 1 49 

2 2201 

13.3206 

48 

6 

9.16 

7.48 

+ 1 68 

2.8224 

16.9344 

60 

4 

8.52 

7.48 

4- 1 04 

1.0816 

4.3264 

147.3218 


of the irrigation factor, working m accordance with a definite 
law, and Bi representing random factors, or random factors 
combined with the irrigation factor. (The irrigation factor 
enters into Bi to the extent that the hypothesis in question 
fails to take account of the true relation between alfalfa 
yield and depth of water applied.) This is, of course, a 
different division of B from that resulting from the applica- 
tion of a linear hypothesis. The present division may be 
set down in summary. 


Nature of vanaMlity 

No. of degrees 
of freedom 

Sum of 
squares 

Mean 

square 

Between arrays, due to regression of 




second degree (Component Bzj 
Deviations from second degree curve 

2 

147 32 


of regression (Component BO 

6 

4 61 

.92 

Total variation between arrajrs 




(Component B) 

7 

151.93 



The seven degrees of freedom entering into component B 
are now divided, five to component Bi and two to compo- 
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nent Bz. The reasons for this allocation of the degrees of 
freedom are similar to those presented in discussing the lin- 
ear hypothesis. As regards component Bz, the item now of 
chief concern to us, it is clear that when a curve defined 
by an equation with three constants is fitted to eight 
points there are five degrees of freedom to deviate from 
that ciuwe. 

Dividing 4.61 by 6 we seciu'e .92, the value of the vari- 
ance comparable to the variance of component A. For 
again we must use a criterion based on A, in determining 
the limits within which variation due to random factors, 
independent of irrigation, may play. We come again to a 
comparison of variances. 


Table 124 


A Test of ike Hypothesis of Curvilinear Rdationship 


Nature of vanability 


Within arrays (Compo- 
nent 4) 

Deviation from second de- 
gree curve of regression 
(Component £«) 


Degrees of 

Mean square 

Natv/rcd logariOim 

freedom 

(variance) 

of mean square 

n 

(T® 

log,a^ 

36 

2.12 

.7514 

5 

.92 

- 0834 


Difference ~ — ,8348 

d m A 


s = - .4174 


In this case the degree of deviation from the curve of 
regression defined by the power curve of the second degree 
is actually less than the deviation within arrays, which 
serves as our yardstick. The value of a is therefore negative, 
equal to - .4174. This measure may be tested for signifi- 
cance by the methods previously discussed. The 2-table 
is entered with ni = 36 (the number corresponding to the 
larger of the two variances), uz = 5. Interpolating in the 
table for these values we obtain 1 . 1158 as the 1 per cent 
value of 2 . The present value is distinctly less than this. 
The difference between the two measures of variance is 
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not significant. The departures from the curve of regression 
may be attributed to “chance,” that is, to random factors 
independent of the irrigation factor. 

In following this general procedure it is necessary to test 
different hypotheses (i.e., different functions) only imtil the 
difference between the variance defined by component A 
and the variance defining departures from the curve of 
regression be fimnll enough to be attributed to the play 
of chance. Thus, if a P of .05 constitutes our standard, 
the difference between the two variances given in the pre- 
ceding table, as measured by 2 , might be positive and as 
great as .4536, without leading to rejection of the hypothesis 
being tested. It could be as great as . 6370 if our standard 
of significance were a P of .01.^ A rather exceptionally 
close fit by the second degree cmwe we have employed gives 
us the negative value of 2 we have actually obtained. 

We have arrived, then, at an hypothesis concerning the 
relation between alfalfa yield and depth of irrigation water 
applied, with which observed facts are not inconsistent. 
Our observations, be it noted, do not establish the truth 
of this hypothesis. Other hypotheses might be equally ten- 
able, and perhaps even more closely in accord with the facts. ® 

1 These figures are derived from Tables VI and VII by the process of interpo- 
lation described above, with ni » 5 and nz ^ 36. (ni is taken as equal to 5, of 
course, only when is greater than A; ni is always taken to represent the 
nuinbOT of degrees of freedom corresponding to the larger of the two variances 
being compared.) This method of interpolation is appHcable over the range of 
the z-table, except for the comer relating to values of ni in excess of 24 and 
values of n 2 in excess of 30. For dealing with cases in this region, R. A. Fisher 
gives the following formulas for approximating the desired quantities: 

5 per cent value of z « 1-6449 ^ , 7843 ^_L _ 

Vh-1 \Wi riij 

1 i. 1 r 2.3263 1\ 

1 per cent value of z « ^ _ . — 1.235( — — — ) 

Vh- 1 \ni n%) 

In these formulas A is the harmonic mean of n\ and n%. That is 

- + — 

I _ni nj 

A* 2 ‘ 

® We could, of course, fit a curve of still higher degree, the equation to which 
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All that we can say is that the observed facts do not dis- 
prove the hypothesis. If the hypothesis is tenable on rational 
grounds, we have reached a conclusion upon which we may 
rest, for the time. 

stjmmary: variance analysis in the measure of 

RELATIONSHIP 

The procedure employed in the last example may be 
summarized and certain measurements presented which 
show the relation of this procedure to methods discussed in 
preceding chapters. The quantitative results are presented 
in Table 125. 


Table 125 


Component Elements of the Vanability of Alfalfa Yield, and Various 
Measures of Correlation 


Toted variability of observations 
reladng to alfalfa yield, and 
components of this total 


Test of 
significance 


Total variability (sum of squared 
deviations from Mean) 228 33 
1. Division of total variabil- 
ity into: 

A, Variation unrelated to 

irrigation factor (i.e., 
variation within ar- 
rays) 76.39 

B, Variation attributable 

to irrigation factor, 2 = 1.1632 

and to other causes, in 1 per cent 

indeterminate propor- value of z 

tions (i.e. variation be- = .6780 

tween arrays) 151 . 94 


Measure of 
correlation 


Correlation ratio 
,^151.94 
228 33 
» .6654 
= .82 


{Footnote !8 continved from page 518,) 

contained four constants, or more, instead of the three constants in the equa- 
tion actually employed. The deviations from this curve of higher degree would 
be smaller than from the curve of second d^ree, and z would be correspondingly 
smaller. It is a principle of scientific procedure, however, to employ the sim- 
plest acceptable function. Needless complexities, whether in the form of un- 
necessary assumptions or of unnecessary constants in an equation of relation- 
ship, are rigorouriy avoided. 
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Table 125 — Continued 


Component Elements of the Variability of Alfalfa Yield, and Various 
Measures of Correlation 


Total variabiliiy of observations 
relating to alfalfa yield, and 
components of this total 

2. Division of component B 

of (1) above into: 

Bi. Variation attributable 
to irrigation factor on 
the assumption of a 
linear relationship 107 15 
jB 2 . Variation attributable 
to irrigation factor, 
but not es 3 )lainable in 
terms of a linear rela- 
tionship, and to other 
causes, in indetermi- 
nate proportions 44 . 79 

3. Division of component B 
of (1) above into: 

Bz. Variation attributable 
to irrigation factor on 
the assumption of a re- 
lationship dejBned by 
power curve of second 
deg^ 147.32 

B*. Variation attributable 
to irrigation factor, 
but not explainable in 
terms of power curve 
of second degree, and 
to other causes, in in- 
determinate propor- 
tions 4.61 


Test of 
significance 


z = .6298 
1 per cent 
value of z 
= .6047 


a — — .4174 
1 per cent 
value of z 
- 1 1158 


Measure of 
correlation 


CoeflScient of 
correlation 
107.15 
228 33 
= .4693 


r* = 


r = 


Index of corre- 
lation 
, 147.32 

^ 228 33 

* .6452 

p = .80 


The meaning of this summary should be clear, with 
reference to the preceding demonstration. Component A 
of the total variability, being independent of the influence 
of the irrigation factor, is the yardstick, or standard of 
reference, which is used in all the tests of significance noted 
in the second column. Component B, in the first test, 
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is shown to be clearly greater than A, when account is 
taken of the number of degrees of freedom present in the 
two quantities. Thereafter, component B is broken into 
sub-components, first on the hypothesis that alfalfa yield 
and irrigation are related by a linear function, next on 
the hypothesis that the relationship is defined by a power 
curve of the second degree. The evidence is not consistent 
with the first of these hypotheses, and it is rejected. (The 
hypothesis would be rejected on rational grounds, as well 
as on the basis of empirical evidence.) The residts are not 
inconsistent with the second hypothesis, and we may accept 
it, subject to the possibility of modification on the basis 
of later experience. 

Three abstract measures of degree of correlation between 
alfalfa yield and applications of irrigation water are given 
in the right-hand column. All of these may be derived 
directly from the quantities employed in the variance 
analysis. Study of the elements of these correlation meas- 
ures, and of the relation of the several measures to the 
corresponding hypothesis, will provide a suggestive review 
of the general problem of correlation. 

We should note here that an assmnption of normality 
is implied in the comparison of standard deviations, or 
of variances, in this type of analysis. Minor departures 
from normality do not materially affect the procedure, but 
substantial departures do so. The conversion to other 
forms (such as logarithms or reciprocals) of observations 
not normally distributed in natural terms will sometimes 
yield no rmal distributions. Where this is possible, the 
precision of the method of variance analysis is increased 
by such conversion. Limitations arising out of material 
departures from normality may be avoided, also, by the 
use of ranks, as is done in the computation of the coefficient 
of rank correlation. Appropriate procedures have been 
developed by Milton Friedman.^ 

1 Use of Ranks to Avoid the Assumption of Normality Imjdieit in the 
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Vabuncb Analysis in Testing the Significance of 
Seasonal Fluctuations 

The methods outlined in this chapter are applicable to 
certain of the problems encountered in the analysis of time 
series. They are peculiarly appropriate in determining 
whether the seasonal fluctuations observed in a given series 
represent a true seasonal pattern. Apparent seasonal move- 
ments would be present in any series of observations covering 
a period of years, by months. Chance factors would create 
some differences between averages of all the Januaries, all 
the Februaries, etc., even though no true seasonal movement 
existed. We require an objective test, to be used in deter- 
mining whether the differences among such monthly aver- 
ages are signiflcant or not. 

The entries in the body of Table 126 are the flgures 
obtained when freight car loadings by months, for the 
period 1918-1927, are expressed as percentages of linear 
trend values. (The originai data are given in Chapter VIII.) 
The arithmetic mean of the ten items for January appears 
in the bottom row, with similar means for the other months.^ 
The test for seasonality involves answering the question: 
Do these means differ significantly from 99.9867, the average 
value, of the 120 items in the table? In seeking to answer 
this question we must break the total variance of the frei^t 
car loadings data into its elements. We wish to define that 
portion of the total variance apparently due to seasonal 
movements. This may then be appraised with reference to a 
yardstick representing what we may call the residual vari- 
ability of the series. 

In computing the total variance we may make use of the 

Analysis of Variance,” Journal ofOie American SUUietical Association, Vol. 32, 
Deo. 1937, 676-701. 

^ These xneans, it may be noted, are not precisely the same as the seasonal 
indices given in Chapter Vni. In seeking to improve the representativeness 
of the monthly indices, only the four oentr^ items for each month were used in 
the averaging process employed in that chapter. Here it is necessary to employ 
the arithmetic mean of all the items for each month. 



Table 126 

Freight Car Loadings in the United States^ 1918-1927 
Monthly Values as Percentages of Trendy with Computations Required in the Analysis of Variance 


16 

Sum of squares 

141,082 37 
118,010 96 
129,076.68 
92.906 98 
106,628 93 
132,897 76 
119,713.90 
126.677.28 
129,713.61 
116,842.48 

1,213,260 14 

16 

Mean 

108.092 
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103 267 
87 633 
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familiax relation ~ ^ ^ deviation 

of an observation from the true mean, d' is the deviation 
from an assumed mean, and c is the difference between true 
htiH assumed means. In this case we take the assumed 
mean at 0 on the original scale, and c is thus equal to the 
mean. Since we wish to work with sums of squared values, 
we use the relationship 

Sd* = S(dO* - NcK 

(The mean should be computed to more places than are to 
be finally retained, since the process of squaring and multi- 
plying by N greatly noagnifies even slight errors.) 

The entries in col. (16) of Table 126 are the sums of 
the squares of the items in the body of the table. Inserting 
the proper values in the above formula, we have 

Sd® = 1,213,250.14 - 120 X 99.9867* 

= 1,213,250.14 - 1,199,680.82 
= 13,569 32. 

As in the alfalfa problem discussed above, this total 
may be broken into an element representing variance 
between the monthly means and variance within the several 
months. (Reference here is to the columns of Table 126.) 
The variance between months may be computed directly 
from the monthly means. 


Thus: 

Smn of squares of deviations of monthly means from grand mean 
= 10 X (99.9867 - 90.60)* -f- 10 X (99.9867 - 92.04)* 

-I- 10 X (99.9867 - 96.38)* -t- . . . 

+ 10 X (99.9867 - 88.35)*. 


That is, the deviation of each monthly •mefl.n from the 
grand mean is squared and weighted by the number of 
items represented by that mean; the sum of the twelve 
figures thus obtained is the required measure of variability 
between months. 
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An alternative shorter method may be employed in 
determining the variance between months, utilizing the 
relationship 

Sd* = S(dO® - Nc*. 

Here each 6/ is the mean value for a given month. Each 
squared value must be weighted by the number of items 
represented by the mean. Thus 

S(dO* = 10(90.60)* + 10(92.04)* + 10(96.38)* + . . . 

+ 10(88.35)* 

= 1,207,068.40. 


The correction factor, iVc*, is the same as in the first opera- 
tion. We have, then. 

Sum of squares of deviations of monthly means from 
grand mean = 1,207,068.40 — 1,199,680.82 

= 7,387.58. 

This sum measures that portion of the total variability 
that may be attributed to seasonal fluctuations. Is it 
significant or does it merely reflect the play of the mass 
of undifferentiated factors we call chance? 

In answering a similar question concerning the alfalfa 
problem we used as a yardstick the variability independent 
of the one factor the effects of which were being studied — 
namely, irrigation differences. In the present case we cotild 
obtain a measure that is independent of seasonality by 
computing the variability within the several colunons of 
Table 126. That is, each item in col. (2) could be deducted 
from the January mean, 90.60, and the sum of the squared 
deviations in this colunm obtained; a simitar sum could 
be obtained for each of the other columns numbered from 
(3) to (13). The grand sum of these figures would be the 
variability within arrays — variability clearly independent 
of seasonal forces since only differences among items for 
the same month enter into it. TMs sum, measuring varia- 
bility within columns, has a value of 6,181.74. The 
variability between columns plus the variability within 
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columns is, of course, equal to the total. That is, 
7,387.58 + 6,181.74 = 13,669.32. 

The measure of variability within columns will not serve 
in the present instance, however. The yardstick should be 
a measure of the variability due to “chance” — to the 
play of a mass of random factors which may not be observed 
and measured individually. Effects that can be clearly 
attributed to specific causal forces should not be included 
in the yardstick. But some of the variability within months 
may be clearly assigned to changes associated with the 
classification by years. The average of the 12 monthly 
items for 1918 is 108.09; that for the 12 months in 1921 
is 87.63. The former was a year of prosperity, the latter 
one of depression. Clearly, some of the differences among 
the items m the January column, or in the May column, 
are definitely attributable to cyclical forces that raise all 
the monthly figures for one year and depress all the monthly 
figures for another year. (The influence of trend is not 
present, since the items in the body of the table are actual 
values expressed as percentages of trend.) The variability 
within months should be corrected by the subtraction of 
that portion of it that may be attributed to factors affecting 
yearly conditions as a whole. 

The influence of cyclical and other forces affecting whole 
years is measured by differences between the averages for 
1918, 1919, 1920, and the other years covered. These 
averages are ^ven in col. (15) of Table 126. The desired 
quantity may be obtained by the precise methods used in 
measuring the variability between months. We have 

= S(d0* - Nc* 
or, 

Sum of squares of deviations of yearly averages from grand mean 
= (12 X 108.092* + 12 X 98.542* + 12 X 103.267* + . . . 

+ 12 X 98.383*) - 120 X 99.9867* 
= 1,203,637.88 - 1,199,680.82 
= 3,867.06. 
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(There will, of course, be ten squared items within paren- 
theses, one for each of the ten years covered by the data.) 
Subtracting 3,857.06 from 6,181.74, the measure of total 
variability within the columns, we have 2,324.68 as the 
balance. This is the desired yardstick. It measures that 
portion of the variability among the original items which 
is clearly independent of the seasonal influence. Secondly, 
it has been corrected by the subtraction of that portion 
of the variability within months which is attributable to 
cyclical and other factors responsible for broad changes 
from year to year. The final balance represents the play 
of forces independent both of seasonal movements and of 
broad swings affecting each yearly value as a whole. This 
residue^, variability, measured by the figure 2,324.68, reflects 
the play of aU those random, undifferentiated factors we 
lump together as chance.^ 

This residual variability may be most readily computed by 
subtracting from the total variability the two figures measur- 
ing, respectively, variability between the means of the months 
and variability between the means of the years. At this stage 
of the computation these figures will be in the form of sunos 
of squared deviations. The form of organization employed in 
Table 126 on page 528 is convenient for these calculations. 

In the application of the test of significance, account 
must be taken of the number of degrees of freedom entering 
into each of these measures of variability. Table 127 indi- 
cates a suitable procedure. 

^ When, as in the present example, the influences of the two variables, or 
principles of classification, are independent, it is valid to use the residual vari- 
ability thus computed as a measure of the strength of random factors. If these 
influences are not independent (if, in terms of the above example, seasonal 
movements affecting the monthly averages and cyclical movements affecting 
the averages should be corrdlated), the residual quantity will not 

be an accurate measure ef truly random factors. When the residu^ quantity 
which is used as the yardstick in variance analysis is derived from observa- 
tions that are alike in respect of both principles of classification (i.e., when 
the quantity measures variance within cells obtained by the application of a 
two-fold principle of classification) this difficulty does not arise. An example 
of this type is given in Appendix E. 
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Table 127 

Analysis of Variance of Freight Car Loadings and Test of Seasonality 


(1) 

(2) 

(3) 

(4) 

(5) 

Nature of 

No, of degrees 

Sum of 

Mean square 
{variance) 

(T^ 

(3) - (2) 

Natural logarithm 

variability 

of freedom 
(n) 

squares 

of mean square 
loge<r^ 

Between means 

of years 

9 

3,867 06 



Between means 

of months 

11 

7,387 58 

671 598 

6 50970 

Residual varia- 

bility 

99 

2,324 68 

23 482 

3 15627 

Total 

119 

13,669 32 

Difference = 3 35343 

3 35343 

2 


= 1.67671 


The item 3,857.06 measures the degree of difference 
between 10 yearly averages. Nine degrees of freedom are 
represented in this figure. (The use of weights in computing 
the sum of the squares does not affect the number of degrees 
of freedom.) Similarly, 11 degrees of freedom are repre- 
sented in the measure of variability between the 12 monthly 
means. The total variability is computed from 120 items, 
so there are 119 degrees of freedom in all. The number 
of degrees of freedom in the residual variability is, therefore, 
119 - (11 -I- 9), or 99. 

The variance between the means of months (i.e., the 
mean square) is 671.698. The residual variance is 23.482. 
The test of seasonality reduces to the question; May the 
variance between the means of months be attributed to 
the random forces responsible for the residual variance? 
Unless the variance between the monthly means is signifi- 
cantly greater than the residual variance, no significance 
may be attached to the observed differences between the 
averages for January, February, March, and the other 
months. 
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The test is applied with reference to the measure z, 
which is equal to half the difference between the natural 
logarithms of the two variances being compared. From 
the entries in Table 127 we compute z as equal to 1.67671. 
Referring to Appendix Table VT we find that for Wi = 11 
and n 2 = 99, the 1 per cent value of z is approximately . 44. 
The present value is distinctly greater than this. The 
results are not consistent with the hypothesis that the 
true value of z is zero. There is clear evidence of the existence 
of a definite seasonal pattern in freight car loadings.^ 

The same yardstick may be applied in testing whether 
the differences between the yearly averages are significant. 
The rather wide variations from year to year in the average 
values of the items in the body of Table 126 represent, 
presumably, the play of cyclical forces plus major “acci- 
dental fluctuations” affecting yearly totals. (The trend 
factor, had it not been ehminated, would have combined 
with these other two to create differences among the yearly 
totals.) But are these year-to-year differences great enough 
to be attributed to definite forces other than the chance 
factors represented in the residual variance we are using 
as yardstick? 

The variance between means of years is equal to 
3,857.06 9, or 428.562. Is this significantly greater than 

23.482, the residual variance? Following the procedure 
illustrated in Table 127 we obtain 1.35352 as the value 
of z. The 1 per cent value of z, for ni = 9, ria = 99, is 
approximately .47. The test indicates that the differences 
between the annual averages are due to definite forces 
other than the random factors represented in the residual 
variance. 

^ In the test here* applied we are proceeding on the assumption that the 
seasonal pattern is constant from year to year. If it is not constant, the ac- 
curacy of the residual variability, as a measure of “chance” factors, and of 
the measure of variability between months will be affected, and the signifi- 
cance of the results will be lessened. If there is reason to bdieve that seasonal 
movements have changed over the period covered, tests of the kind suggested 
in Chapter VIII should precede the tests here discussed. 
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THE MEASUREMENT OF RELATIONSHIP: 
MULTIPLE AND PARTIAL CORRELATION 

In dealing with methods of defining correlation in the 
preceding chapters we have been concerned with problems 
involving only two variables, a dependent variable and a 
single independent variable. We have found, in certain 
cases, a fairly high degree of correlation between the two 
variables studied. But it is obvious that, in general, 
economic phenomena are affected by more than one factor, 
that the fluctuations in a single variable may be due to 
the interaction of many forces. In dealing with just two 
variables aU other factors are ignored, on the assumption, 
usually, that in the sin^e independent variable are foimd 
the most important causes^ of fluctuations in the dependent 
variable. Thus, in the alfalfa example given, the effect 
upon yield of but a single factor, irrigation, was studied. 
Yet variations in rainfall and temperature must have 
affected the yield in the different years studied. Similarly, 
variations in practically every factor dealt with in economic 
analysis are traceable to more than one cause. If our 
analysis is to be complete we must employ methods which 
will enable more than two variables to be handled at a 
time. We need instruments that will assist us in measuring 
the combined effect upon a single variable of a number 
of factors. Such instruments may be secured by a simple 
extension of methods already familiar. 

In Table 128 are presented figures showing the yield 
of com, per acre, in Kansas from 1890 to 1933, together 

^ This should not be taken to mean that the coefficient of correlation meas- 
ures Or establishes causal relationships. 
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Table 128 



Com Yield and Temperaiure 

in Kansas, 1890-1933 

1 

(1) 

Average 

(3) 

Average 

Average 

Average 


yield per 

June 

Jvly 

August 

Year 

acre, in 

temperor 

temperon 

temperor 

hushds 

lure 

lure 

ture 


Xi 

Xi 

Xz 

X\ 

1890 

15 6 

77.6 

83 1 

76 1 

1891 

26.7 

70.7 

74.0 

75 1 

1892 

24.5 

73 4 

77.5 

76.5 

1893 

21.3 

74 7 

79.5 

73.8 

1894 

11 2 

74 2 

77.8 

78.0 

1895 

24.3 

71 7 

74 9 

76 0 

1896 

28 0 

74 1 

78 1 

78.7 

1897 

18.0 

76 6 

80.2 

76.0 

1898 

16 0 

76.0 

77.7 

78.2 

1899 

27 0 

78 9 

76.2 

80.6 

1900 

19 0 

74 9 

77.9 

81.0 

1901 

7 8 

77 3 

85.0 

79.1 

1902 

29 9 

70.9 

76 8 

78.2 

1903 

25 6 

67.2 

78.3 

75.3 

1904 

20 9 

70.4 

75.6 

74.6 

1905 

27.7 

75.5 

74.6 

78 7 

1906 

28 9 

71.8 

73.8 

76 3 

1907 

22.1 

72.0 

78.4 

78 1 

1908 

22 0 

72.1 

75 8 

76.2 

1909 

19.9 

73.1 

78.1 

80.1 

1910 

19.0 

72 2 

79 5 

75.7 

1911 

14.5 

80 5 

78.6 

76.4 

1912 

23 0 

69.3 

79.9 

77.4 

1913 

3 2 

74,2 

82.1 

84.2 

1914 

18 5 

78 2 

79 9 

78 2 

1915 

31 0 

69.2 

74 0 

70 1 

1916 

10.0 

70 3 

81.2 

79 6 

1917 

13 0 

72.8 

80 8 

73 4 

1918 

7,1 

78.4 

78.3 

82 3 

1919 

15.2 

72.3 

80.2 

78.3 

1920 

26.5 

72 8 

77.6 

72.9 

1921 

22 2 

74.4 

79.2 

78.6 

1922 

19.3 

75.2 

77.0 

80.1 

1923 

21.7 

73.3 

79.4 

78.3 

1924 

21.7 

74.3 

75.1 

79.0 

1925 

16.6 

77 7 

79.7 

77.4 

1926 

11.0 

72.5 

78.4 

79.1 

1927 

30.0 

70,9 

76.9 

73.1 

1928 

27.0 

67.7 

78.1 

77.1 

1929 

17 5 

72.2 

78.8 

78.9 

1930 

12.0 

73.1 

81.7 

80.3 

1931 

17.5 

78.1 

80.6 

76.1 

1932 

18.5 

74.3 

81.8 

79.2 

1933 

11.5 

80.5 

81.4 

76.8 


^ The data of com yield are from BvMetin 515, U. S. D. A., and from the 
Yearbooks of the U. S. D. A. Temperature data are from reports of the U. S. 
Weather Bureau for Dodge City, Concordia and lola. 
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with the average June, July, and August temperatures for 
each of these years. 

The Relation between Corn Yield and Temperatubb : 

Preuminaet Analysis 

It is known that com yield is affected by the temperature 
during the growing season. The object of the present 
study is the determination of the precise relation between 
yield and temperature dining each of the three months 
given, in order to secure a basis for estimating the yield 
from a knowledge of the temperature. As certain growing 
months are more important than others, the relation 
between temperature and yield may be determined, first, 
for each of the three months separately. 

The equation which describes the relationship between 
yield per acre and June temperature will be of the type 

= o “b biaXi^ 

The equation describiog the relationship between yield per 
acre and July temparature will be of the type 

Xi = <!!-(" buXa. 

(In each case Xi represents average com yield per acre, 
for the State, while Xj, X», etc., represent the absolute 
temperature, in degrees Fahrenheit.) Instead of using to 
represent the variables the symbols Y and X, as in the 
preceding examples, Xi, Xs, X*, etc., are employed, Xi 
representing in this case the dependent variable. The 
symbol for the constant representing the slope (the coefficient 
of regression) is, in the first instance above, bu. The 
subscripts 1 and 2 indicate the variables to which this 
constant refers, the first subscript always representing the 
dependent variable (Xi in the example cited), the second 
the independent variable (X 2 in the illustration above). 
These subscripts are necessary to distinguish the different 
constants when several variables enter into the problem. 
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The TnPifl.niTig is precisely the same as in the former examples 
when no subscripts were needed because only two variables 
were dealt with. 

Solving the proper normal equations for the constants 
in the equation which describes the average relationship 
between yield per acre and June temperature, we have 

Xi = 100.35 - 1.096Zj. 

The value of Su may be determined from the formula 

S(Zi*) - dS(X 0 - 6 i2S(XxZs) 

Sn 

(The subscripts to S, and those to r which appear below, 
have the same meaning as those employed in the preceding 
paragraph.) Substituting the given values, and solving, 
we have 

= 33 593 

and 

Sa = 5.80. 

The significance of the standard error, /S, as a measure 
of the reliability of estimates based upon the equation of 
relationship, has been fuUy explained. In judging of the 
usefulness of the equation, should be compared with Ci 
(the standard deviation of Xi) which may be looked upon 
as a measure of the reliability of estimates based upon the 
arithmetic mean of the variable Xi. For this we have 

<ri = 6.68. 


Clearly, the estimates from the equation are more reliable 
than those based upon the mean. The coefficient of correlar 
tion, r, egresses this relationship in abstract terms. We 
may get this value from the equation 

_ aS(Xi) + 

“ 2(Zi») - Xci* 

Solving for r, and giving it the sign of 612, we have 

ru = — .4984. 
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These values indicate a negative correlation, though not 
a high one, between yield per acre of com and June tem- 
perature in Kansas. Let us see if the estimates could be 
improved if based upon the temperature in July instead 
of in June. 

The values needed in this study may be computed from 
Table 128. Solving for the constants in the equation of 
regression, we secure the equation 

Xi = 166.07 - 1.866Zs. 

For the standard error, we have 

-Sis =4.81 

and for the coefficient of correlation 


ris = — .6948. 

We have here a closer relation and a better basis for 
estimates than in the case when June temperature was 
considered. 

Repeating the process for yield per acre and August 
temperature, we have 

Xi = 119.45 - 1 . 288 X 4 
Su = 5.78 
ri 4 = - .5013. 

August temperature, it is evident, also affects the corn 
yield in Kansas, a low temperature conducing to 3deld 
above normal. The relationship is not so close as in the 
ease of July temperature, but it is still significant. What 
is needed now is some method of combining these three 
factors, in order that an estimate may be based upon a 
knowledge of their influence, in combination, upon the 
yield of com. The addition or averaging of the temperatures 
in the three months will not do, for July is obviously more 
important than either of the other months. The principle 
of the method by which this may be accomplished is simple. 
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The Estimation of Corn Yieid prom Three 
Independent Variables 

The estimating or regression equation in the present case 
will be one in which there is a single dependent variable 
(corn yield) and three independent variables. It will be 
of the form 


•Yi = CL 612.34X2 " 1 “ 613^4X3 "I" 61423X4. 

If we can determine the values of the four constants, we 
may substitute given values of X2, X3, and X4 in the equa- 
tion and thus get an estimate for Xi in precisely the same 
way as when two variables are dealt with. The method 
of least squares affords the means of solving for the required 
constants. 

The symbols require a word of explanation, as a perfectly 
simple equation is given a rather ponderous appearance 
by all the subscripts employed. The symbol 612, it has been 
explained, represents the coefficient of regression of Xi on Xi 
(i.e., the slope of the line describing their relationship, Xi 
being dependent) when these two variables alone are 
included in the study. The symbol 612.34 represents the 
coeffideni of net regression of Xi on X2. The addition of the 
subscripts 3 and 4 to the right of the period means, simply, 
that the variables X3 and X4 have been included in the 
study and the effects of their variations ehminated, in so 
far as this one constant (612.34) is concerned. This constant 
measures the weight which must be given to the variable 
X2 in an estimate of Xi based upon the three independent 
variables, X2, Xs, and X4. It will not, of course, be the 
same as 612, which indicates the weight given to X2 when 
an estimate of Xi is based upon X2 alone. Similarly the 
constant 613.24, the coefficient of net regression of Xi on Xs, 
measures the wei^t given to Xs when X2 and X4 are also 
included. Each coefficient represents a single, simple con- 
stant, but the subscripts axe necessary in order that the 
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precise meaning of this constant may be clear. The subscripts 
to the left of the period are termed primary subscripts, 
those to the right secondary subscripts. 

FOBMAHON AND SOLUTION OP THE NORMAL EQUATIONS 

The first task^ is the securing of the normal equations 
required in solving for the constants in the estimating 
equation given above. Following the usual procedure® we 
have: 

I S(Xi) — Na 6i2.84S(Xs) + i-u3iS(Xi) + 6i4^sS(X4) 

II S(M2) = dZ(Xi) + 6i2mS(Z2®) + 

+ bu:ix^(XiXi) 

III SCXiZs) = d2(Xi) + bvit^iXiXz) + 6ijj4S{X**) 

+ bn 2sS(XaX^ 

TV HiXiX,) = dSiX,) + bM34S(Z*Z4) + bij^(XjZ4) 

+ bi4^82(Z4®). 

The given values might be substituted in these simul- 
taneous equations and solutions secured directly for the 
four constants. It is possible to reduce the number of 
normal equations by one, however, and thus lessen mate- 
rially the labor of computation. This is done by using 
deviations from the arithmetic mean for each variable 
instead of absolute values, getting rid in this way of the 
constant term a in the origmal equation. 

If we let Ai, A 2 , As, etc., represent the arithmetic means 
of the different variables while Xi, xs, etc., represent 
deviations from the means, we may replace the absolute 
numbers Xi, Xs, Xz, etc., by their equivalents, Xi + Ai, 
Xi + Ai, Xi + As, etc. Making these substitutions in the 
normal equations, certain al^braic simplifications are pos- 

I The approach to the problem of multiide corr^lioa which is here taken 
follows that of H. R. ToUey and M. J. B. Ezekiel “A Method of Handling 
Multiple Correlation Problems,” Jovmdl 0 / the American Statieiical Associa- 
tion, December, 1923, 993-1003. 

* Gf . Appendix A for a discussion of this procedure and of the methods em- 
ployed in simplifying the normal equations. 
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sible which eliminate the first of the normal equations, 
and reduce the others to the following form: 


2{xixt) 'L(xa^) 



, ^( xsXt ) I , 2(0:43:4) I 

>4 T Tir 84 T iir l>14 23 


N 


N 


'Z(xix») 2 ( 3 : 33 : 8 ) ^ , 2 ( 3 : 3 ®) 1 I 2 ( 3 : 33 : 4 ) 

= ~jr~ "I — 

2(3:10:4) _ 2(3:32:4) I , 2(3:33:4) I , 2(X4*) 
-Jf- = — 77 — + ~!f~ 


5i4 28 

5i4 23> 


All the variables in the above equations refer to deviations 

from the respective arithmetic means. Therefore 

is simply the mean product of the variables a:i and 3 : 3 , 

— - is (Ts®, etc. Representing the various mean products 

by the symbols pu, Pn, etc., and inserting the symbols 
for the squares of the standard deviations, we secure, for 
the normal equations: 


Pa = <r2*5i2 34 + P235i3.24 + PaJ>uss 
Pa = P285 i2 34 + + PsJ>uw 

Pu — PubaM + P345i 3.24 + <^4®5l4 28. 

This is the most convenient form for the solution of the 
normal equations. 

From the data, as arranged in Table 128, the following 
values are derived: 


2(Xi) = 863.9 
2(Z4) = 3,241.6 
2(Z3) = 3,453.4 
2(Z4) = 3,409.1 


2(Zi®) = 18,928.17 
2(Z3®) = 239,209.57 
2(Z8®) = 271,317.92 
2(Z4®) = 264,433.19 


2(ZiZ4) = 63,198.42 
2(ZiZ3) = 67,295.48 
2(ZiZ4) = 66,550.84 
2(Z*Z,) = 254,544.98 
2(Z*Z4) = 251,246.54 
2(ZsZ4) = 267,649.61 
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j\r 

= 19.6341 Cl® = 385.4979 

<5! = 73 . 6705 C2® = 5,427 . 3426 

cj = 78 . 4864 c,® = 6,160 . 1 150 

C4 = 77 . 4795 C4® = 6,003 . 0729 

From these values, the quantities necessary for the solu- 
tion of the normal equations may be readily determined. 
These quantities are brought together below: 

<ri® Cl® 


18,928.17 


- 385.4979 = 44.6878 


, = -57 _ 5 427.3426 = 9.2385 

44 

^3* = _ 6,160.1150 = 6.2013 

^^2 = _ 0 003.0729 = 6.7723 


S(XiX8) 


63,198.42 


- 1,446.45396 = - 10.1263 


67,295.48 t ^ At aaao h ca/rt 
piz = — 1,541.0098 = — 11.5671 

Pi4 = - - 1,521.2403 = - 8.7213 

Pm = - 5,782.1323 = 2.9808 


= 251 ^ . 5 , 707.9535 = 2 .] 


Pm = - 6,081.0870 = 1.8586. 
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Substituting in the normal equations, we have: 

- 10.1263 = 9.2385bi2S4 + 2.98086i8 2 « + 2.19516i4ss 

- 11.5671 = 2.98085i2 34 + 6.2013&13 2 * + 1 . 868661423 

- 8 . 7213 = 2 . 19516i2.84 + 1 • 8686613 js* + 6 . 77236i4.23. 

Solving these simultaneous equations^ we secure the fol- 
lowing values for the constants: 

612.34 — — 0.460 6 i 8 24 — — 1.420 614 ^ = 0.749. 

The required equation is, therefore, 

xi= — 0.460*2 — 1.420*3 — 0 . 749 * 4 . 

This is the equation of regression of *1 on *2, x», and *4. 
Any given values of the three independent variables (June 
temperature, July temperature, and August temperature) 
may be substituted in this equation, and the most prob- 
able value of the dependent variable (corn yield per acre) 
determined. In the equation as it stands, it should be noted, 
all the variables are expressed as deviations from their re- 
spective arithmetic means. For practical purposes it is ad- 
visable to have an equation in terms of the original values. 
In other words, it is desirable to shift the origin f?om the 
point of averages to the zero point on the original scales. 
This necessitates re-introducing the constant term a. 

The value of a may be determined from the equation 

Ai = a A2612.33 + ^3613^4 + ^4614^3 
where the A’s represent the respective arithmetic means.® 
Inserting the proper values, we have 

1 Any metliod of solution may be employed. Perhaps the most convenient 
TTith three or more equal^ons is the Doolittle method. Tl^ is explained in 
detail in Appendix A. 

* This equation is derived from the first normal equation, as given on p. 537, 
S(Xi) Na + fei8-84S(Xa) + + 5i4.282!(X4). 

Replacing the absolute numbers Xi, X%j etc., by their equivalents Xi + Ai, 
Xi + Aa, etc., we secure 

XM + NAi = iVo + 6ia,,4[S(2:a) + NA^] + bi3.24[S(a:j) + 

+ hi 2slS(r4) + 1V’A4I. 

Since *** 0 , 2 )(a;a) « 0 , etc., these values disappear. Dividing through by 
N we obtain the equation presented above. 
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19.6341 = a + 73.6705(- 0.46005) + 78.4864 (- 1.41967) 

+ 77.4795(- 0.74910).! 

Solving, 

0 = 222.99. 

The equMion of regression in terms of original values is, 
therefore, 

Xi = 222.99 - 0.460X4 - 1.420Z8 - 0.749X4. 


COMPUTATION OP THE STANDAKD ERROR OP BSTIMA.TE 

Are estimates based upon this equation any more reliable 
than those based upon the equations previously derived, 
each of which referred to a single independent variable? 
To answer this question the value of the standard error 
must be computed. This will be represented in the present 
case by Si. 284 , the subscripts referring to the single dependent 
variable (Xi) and the three independent variables. This 
value may be computed from the formula* 

^ The arbitrary origin is at zero on each of the origioal scales, hence Ai ^ ci, 
Ai = C 2 , etc. To ensure greater accuracy in solving for a, the ^ues of the co- 
efi&cients bn z*, biz ti, etc., are given to a greater number of decimal places than 
in the equation of regression. 

^ This formula may be derived as follows: Given an equation of the type 

Xl = 612.34372 + 618.24378 + 614.28374 

(in which the variables refer to deviations from the means) each residual may 
be computed from the equation 

d ** 612.34372 + biz .% lP ^ + 614.23374 "" 35 i . ( 1 ) 

Multiplying throughout by d, and adding, we have 

s(d*) = 612.342(^372) + 6i3.242(d378) + 6 i4.282(c23?4) — ^( dxj ) 

but it follows from the method of fitting that 

XidXi) * 0 
S(da78) - 0 
2;(da74) =»0 

and, therefore, S(d*) « — (2) 

Multiplying each residual equation (1) by xi and adding, we have 
!S(c2a7i) ■■ 6i2.842(xia;2) + 6i3.242(a;ia7i) + 614 . 232 ( 0 ^ 374 ) —• 2(a7i*). 
Substituting the equivalent of 2(da7i) in equation (2) we secure 

(Footnote cotnUnued on •page 5Ji2^ 
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jS*i.aa4 =“ O'!* — &i2.84Pia — bitupts biijiipii- 

Substituting the proper values, we have 

S^.2S4 = 44.6878 - 4.6586 - 16.4215 - 6.6331 
= 17.0746 
-Soj4 = 4.13.1 

This is to be interpreted just as the standard error of estimate 
was interpreted in previous cases. The reliability of estimates 
based upon the mean value of is measured by ci, which 
has a value of 6.68. The reliability of estimates based 
upon the equation of net regression, when yield is considered 
as a function of temperature in June, July, and August, is 
measured by <81.234 which has a value of 4 . 13 . It is clear 
that estimates made from the equation are distinctly more 
reliable than those based upon a knowledge of Xi alone. 
We have by no means accounted for all the factors that 
are responsible for variability in corn yield, but we have 
measured and reduced to precise terms the effect of three 
factors upon the yield of com per acre in Kansas. 


{Footnote % continued from 'page S41-) 

=» S(a;i®) — &1S 842 /(iCi 3 ;s) — bisi^ixixf) — 614.232(^1^4) 


N 


N 


- 612.1 


S(gia?2) 

N 


-6 


18 . 24 ’- 


JjjXiXi) 

N 


“ 6 i 4 . 2 r 


S(a;ia; 4 ) 


N 


Since the variables refer to deviations from the means, we have 
<S® 1.234 * <ri* — — 6 i 3 . 2 <P 13 — 614 23 Pl 4 . 

See Appendix A for a general derivation of these relations. 

^ For precise work, when the sample is small, allowance should be made in 
computing S for the number of constants in the equation of regression. Since 
there are four constants in the present equation, the 44 observations have but 
40 d^ees of freedom to deviate from the computed values. Denoting by S 
the corrected value of the standard error of estimate, and by m the number of 
constants in the equation of regression, Ezekiel gives 

applying this correction to the present measurements, we have 

- 17.0746(gfi) 

= 18.355 
Sum - 4.28. 
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THE COEFFICIENT OF MULTIPLE COEHBLATTON 

We have need now of our third measure, the abstract 
coefficient of correlation. The value of this coefficient, as 
we have seen, depends upon the relation between S and <r. 
It may be computed in the present instance from the 
formula 


E>2 1 5*1.284 

Ifc 1.284 = 1 j— 

ffl* 

When the relationship between a single dependent variable 
and several independent variables is being studied, this 
measure is termed the coefficient of multiple correlation 
and is represented by the symbol S. The subscript to 
the left of the period relates to the dependent variable, 
while those to the right relate to the independent variables. 
Substituting in this formula the equivalent of we have 

D2 _ 1 O’!* ~ 5i2.34Pi2 ~ 5iS24Pi 8 ~ &14.28Pl4 

it 1284 = I 2 

Vl 

which reduces to* 


D2 5i2 84 P 12 + bitS4(PlS + 5i4.28P14 

it 1J84 = 2 

Vl 


Inserting the proper values we have 


.B*1.284 = 


4.6S86 + 16.4215 + 6.5331 
44.6878 


B*ij84 = .6179 
Ei^84 = .786. 


For the same reason that estimates of p computed from 
samples must be corrected by making allowanee for the 
number of coiistants in the regression equation, correction 


^ The coefficient of multiple correlation may also be derived from the general 
formula, which refers to an origin at zero on the original scales. This general 
formula is 
30 n 

_aS(Xi)+bii.s4 • • • wS(XiXa)+hi8.j4 • • • • • * nS(XiX4)+* • • “iSToi* 

" 2;(Zi)» - Nci} 
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miist be made in R. For if the number of constants is 
equal to the nun^er of observations, R will necessarily 
equal 1. Using R to denote the corrected coefficient of 
multiple correlation and m to denote the number of con- 
stants in the equation of regression, Ezekiel gives 

S> = 1- {(l-S«)(^)}- 
In the present example 

B* = l- {(1- •6179)(g^)} 

_ = .5892 
R = .768. 

In later references to this illustration the Tmcoixected 
measure is used, though it is to be understood that the 
corrected measure provides a somewhat closer approximation 
to the true R than does the uncorrected coefficient. 

The coefficient of multiple correlation is an index of the 
degree of relationship between a single dependent variable 
and a number of independent variables, in combination. 
It measures the degree to which variations in the dependent 
variable are related to the combined action of the other 
factors. Its significance may be clearer if all the independent 
variables are looked upon as constituting a single independ- 
ent series. The coefficient is then seen to be a measure 
of the relationship between the dependent variable and the 
independent series, which is precisely what the coefficient 
of correlation is in the simpler case of two variables. In 
the multiple case the independent series has several com- 
ponent elements, but this fact does not alter the essential 
significance of the coefficient. No positive or negative sig n 
is attached to R, it should be noted. In the present instance 
all of the independent variables are negatively correlated 
with com yield, and a negative sign might be attached. 
The correlation could be positive, however, for some of 
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the independent variables, and negative for others. Because 
of this fact, E is always given without sign. The signs of 
the constants in the equation of net regression show which 
of the independent variables are positively correlated and 
which are negatively correlated with the dependent variable. 

The sampling error of the coefficient of multiple correlation 
may be estimated from the formula 



where m is the number of constants in the equation of 
regression. A more accurate test of the significance of B 
may be applied with reference to Fisher’s z-table, discussed 
in Chapter XV. The deviations of actual from computed 
values serve as a yardstick for testing the variability in 
Xi that is attributable to Xa, Xs, and X4, as the relationship 
is defined by the equation of regression. In common with 
other correlation problems, this one reduces to a comparison 
of variances. 

The sum of the squares of the deviations of the observed 
values of Xi from the computed values is 761.2824. The 
sum of the squares of the deviations of the computed 
values of Xi from the mean value of Xi is 1,214.9808. 
Since there are 44 observations, and since the equation 
of regression contains four constants, there are 40 degrees 
of freedom in the deviations from the regression fimction. 
The three coefficients of regression (other than the con- 
stant a) give three degrees of freedom to variation among the 
computed values of Xi. The test takes the following form. 


Nature of variability 

Degrees of 
freedom 

Sum of 
squar^ 
deviations 

Mean 

square 

(T* 

Log, <r* 

Variation among computed 
values 

Deviation of observed from 

3 

1,214 9808 

404.9936 

6.0036 

computed values 

40 

761 2824 

18.7821 

2.0329 


43 

1,966. 2632 Difference > 

-3.0710 




z = 

-1.5355 
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For ni = 3, n 2 = 40, the 1 per cent value of z, as derived 
from Appendix Table VI, is .7308. The present value is 
greatly in excess of this. The variation in Xi attributable 
to the influence of Xt, Xj, and Xi is clearly greater than 
the residual variability here used as the yardstick. The 
measure of correlation, R, is unquestionably significant. 


COMPABISON OP MEASURES OP REIATIONSHIP 

The degree to which our knowledge of the causes of 
variation in corn yield has been improved and the reliability 
of our estimates increased by taking account of the various 
factors in combination may be more readily appreciated 
if we bring together the various measures secured in the 
course of this analysis. 


Table 129 


A Comparison of Certain Measures Pertaining to the Com 
Yield in Kansas 



Measure of 

Coefficient 

Basis of estimate 

reliability 

of 


of estimate 

correlation 

Arithmetio mean of Zi = 19.63 

O’! = 6 68 


Zi = 100 35 - 1.096X, 

Si2 = 5 80 

ri2 = - .4984 

Zi = 166.07 - 1.866Z, 

^18 = 4.81 

ris = — . 6948 

Zi = 119 45 - 1 288Z4 

Zi = 222 99 - 0 460Z8 - 1 420Zj 

^14= 6.78 

ri4 * - .5013 

- 0 749Z4 

iSi.284 = 4. 13 

^1.284 ~ .7861 


The value of S might be further reduced and the value 
of B correspondiogly increased by bringing into the analysis 
other factors, such as rainfall during the growing months. 
The method which has been explained may be extended 
to cover any number of variables, one equation being added 
to the set of simultaneous equations for each additional 
variable introduced. 
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THE METHOD OP MTILTIPLE COREBLATION VALID FOB UNEAB 
KELATIONBHIPS 

One important condition has not been emphasized in the 
course of the preceding discussion. The validity of this 
method of multiple correlation depends upon the existence 
of a linear relationship between each pair of variables. 
Thus with four variables there were she pairings possible 
(i.e., six mean products were computed). If there had been 
a material departure from linearity in any of these six 
relationships the significance of the results would have been 
decreased. There would be no fallacy involved in the use 
of the equation under these conditions, but it would not 
furnish as good a basis for estimates as one which took 
account of the true relationship. In such a case the values 
of S and B would indicate that the estimates based upon 
the assumption of linear relationship were not very reliable.^ 

AN APPLICATION OP THE METHOD 

Let US illustrate the use of the estimating equation. 
In the year 1933 the average Jime temperature in Kansas 
was 80.5° F., the average July temperature was 81.4° F., 
and the average August temperature was 76.8° F. What 
was the probable com yield per acre? Substituting these 
values for Xa, Xz, and X* in the equation, 

Xi « 222.99 - 0.460Xa - 1.420X8 - 0.749X4 
we have 

Xa = 222.99 - (0.460 X 80.5) - (1.420 X 81.4) 

- (0.749 X 76.8) 

Xi = 12.85. 

The estimated yield for 1933 is thus 12.85 bushels per acre. 

^An approach to problems of multiple correlatioii when the relationship 
between the subordinate series is non-linear is esqplained by M. J. B. Ezekiel in 
the JowmaL of the American Statistical Assodationf Vol. XIX, N. S. No. 148, 
1924, and in his book Methods of Corrdation Analysis, 
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What are the limits within which we may expect the 
actual yield to fall, with respect to this estimate? The value 
of SiM is 4.13 budiels. This means that the odds are 
68 out of 100 that the actual yield will be within the 
limits 8 . 72 bushels (i.e., 12.85 — 4 . 13) and 16 . 98 bushels 
(i.e., 12.85 +4.13). The actual yield in 1933 was 11.5 
bushels per acre. 

In this illustration we have used one of the years in- 
cluded in the study. The same method would be employed 
in Tina.]ri-ng an estimate for a future year. (Additional ele- 
ments of uncertainty are introduced, of comrse, whenever 
results secured for one period are applied to another time 
period.) Thus, from the temperatures in 1936 (76.7° in 
June, ^.5° in July, and 84.4° in August), an estimate 
of 3 . 1 bushels per acre is yielded by the regression equation 
employed above. This was a summer of exceptional heat 
and drought. The actual yield was 4.0 bushels per acre. 

The Meaning op Paetial ob Net Coreelation 

In the preceding section we have sought to determine 
the degree to which com yield in Kansas is affected by the 
temperature in Jime, July, and August, treating the three 
independent variables in combination. Our aim has been 
to measure their combined effect upon corn yield. There 
is a related problem, which in many studies may be of 
major importance. This is the determination of the rela- 
tionship between a dependent variable and a single indepen- 
dent variable when oK other factors indvded in the study are 
held constant. Concretely, what would be the effect upon 
corn jdeld of variations in July temperature, if June tempera- 
ture and August temperature could be held constant? This 
is the problem of net or partial correlation. 

It is obvious that if a method could be developed by 
which two variables could be isolated for separate study, it 
would add immeasurably to the anal 3 rtical powers of the 
economist, and of social scientists in general. It would give 



NATURE OF NET CORRELATION 549 


to the student in these fields that power to eliminate irrele- 
vant influences and to concentrate his attention upon a single 
factor which is possessed by the chemist, for example. In 
studying the effect of one element upon another the chemist 
seeks to eliminate aU other elements, and the effectiveness of 
his analysis depends in large part upon the degree to which it 
is possible thus to isolate the object of immediate interest. 

It is not generally possible in economic analysis to 
eliminate all but one of the factors responsible for variations 
in a given series. The direct and indirect causes of a given 
economic phenomenon are too numerous and too complicated 
in their interaction for the econoroist ever to hope to emulate 
the chemist in reducing his problem to terms of but two 
variables. But, within certain limits, the statistician is 
able to employ the method of the physical scientist in 
holding constant certain factors while the effects of varia- 
tions in another are studied. The methods which make this 
possible are among the most powerful of the instruments 
which the student of the social sciences possesses. 

The method of partial correlation may be explained with 
reference to the problem of com yield in Kansas. Our 
object is to determine the net correlation between com 
yield and the temperature in each of the three months 
for which the average temperature is given. 

DISTINCTION BETWEEN PAETIAXi AND SIMPLE COEHELATION 

It is important to distinguish between this problem and 
that faced in the ordinary measurement of relationship 
between two variables. We have already secured, as a 
description of the average relationship between com yield 
and July temperature, the equation 

Zx = 166.07 - 1.866Zs 

with 


and 


8u = 4.81 
ri,= - .6948. 
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These measures describe the relationship in question when 
all other factors are ignored. They are not taken account 
of. They are merely neglected. It is as though the chemist, 
in studying the reaction of one element to another, used 
a test tube containing various impurities, which he made 
no attempt to remove. The economist cannot, in general, 
locate and remove all the “impiuities” in his problem, but 
he should recognize that his measures relate to such 
uncoirected data. 

The Method op PabtiaIj Coeeblation 

In seeking to determine the net correlation between com 
yield and July temperature we attempt to secure a measure 
of the correlation which would prevail if other factors ncdght 
be held constant. We shall take full account of the other 
factors we have studied, but we shall try to secure a meas- 
ure influenced only by fluctuations in July temperature, in 
relation to corn yield. 

One possible method of accomplishing this end may be 
suggested. If one possessed data covering a very long 
period we might be able to pick out a number of years 
during which the average temperatures in June and August 
remained unchanged. Let us say that we could find thirty 
years in all, durmg each of wMch the June temperature 
averaged 74° and the August temperature 78°. Com yield 
and July temperature varied during these years. The re- 
lationship between July temperature and com 3 deld noight 
now be measured, and it would be certain that the results 
would not be affected by the presence of fluctuations in 
June temperatme and August temperatme. Unfortunately, 
this method of holding certain factors constant cannot ^ 
employed. The data are too limited and too varied, in 
general, to enable us to pick from among them such figures 
as are appropriate to our purpose. Other methods of 
arriving at the same end are available, however. 

As a first step, let us derive the equation defining the 
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relationship between com yield as dependent variable and 
June temperature and August temperature as independent 
variables. This will be of the form 

= a + hn.iXs + 614.2X4. 

We solve for the constants exactly as in the preceding 
example, except that variables Xi, Xa, and Xt only are,> 
employed. The desired equation is 

Xi = 160.97 - 0.856X2 - I.OIOX 4 . 

We may determine the value of the standard error of 
estimate from the relation 

<S*i.24 = vi* — 612.4P12 ~ 6142P14. 

We secure 

5*124 = 27.2112 

5i.24 ” 5 . 22. 

If com yield per acre is estimated from Jtme temperature 
and August temperature the standard error of e^imate, 
or the standard deviation of the remaining variability, is 
5 . 22 bushels. But we know that if com yield is estimated 
from June, July, and August temperature, the standard 
error of estimate, or the standard deviation of the remaining 
variability, is 4.13 bushels. The measure of remaining or 
“unexplained” variability is reduced from 5.22 to 4.13 
by the addition of July temperature (Xj) to the estimating 
equation, after accoimt has already been taken of the 
influence of June temperature (X 2 ) and August temperature 
(X 4 ). The difference between these two measures may be 
taken to represent a relationship between Xi and Xj which 
is not affected by variations in X 2 and X 4 . 

We have seen that the degree of correlation between a 
dependent variable (Xi) and an independent variable (Xa) 
may be deflned by the relation 

5 * 1 , 



56i 


PARTIAL CORRELATION 


Here the denominator of the fraction in the right-hand 
member defines the original variability of Xi, while the 
numerator of that fraction defines the variability of Xi 
after account has been taken of the influence of Xt. In 
the present problem we have 

A -t 23.1134 yfOOO 
ris = — .695. 


The coefficient of correlation is given the sign of &«, the 
coefficient of regression. 

In exactly the same way, we may say that the net 
correlation between Xi and Xa, when the relationship is 
not affected by fluctuations in Xa and X4, is defined by the 
relation 


r® 13 S 4 == 1 — 


-S^l.24’ 


Here the denominator of the fraction in the right-hand 
member defines the variability remaining in Xi after account 
has been taken of the influence of X2 and X4, while the 
numerator defines the variability remaining in Xi after 
account has been taken of the influence of Xa, Xa, and Xi. 
Numerator and denominator differ only because of the presence 
of corrdation betujeen Xi and Xa that is incremental to any 
correlation that may exist between Xi on the one hand and 
Xj and Xi on the other. If the equation 


Xi = 222.99 - 0.460Xj - 1.420X, - 0.749X4 


gives estimates no more reliable than those derived from 
the equation 

Xi = 160.97 - 0.856Xj - I.OIOX 4 

then numerator (jS*i.js 4) and denominator (SSm) of the above 
fraction will be equal, and will have a value of zero. 
But if the equation containing Xa, Xs, and X4 as independent 
variables gives better estimates than does the equation 
containing only Xa and X4, the numerator will be smaller 
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than the denominator, and mil have a value other 
than zero. If the estimates based upon the three independ- 
ent variables are in exact agreement with observed jdelds, 
5* 1.2S4 wUl be equal to zero, and ?*i 2.84 will have a value of 
unity. 

Employing the values derived above, we have 


r®i3 24 == 1 — 


17.0746 

27.2112 


= .3725 


^18.24 = ~ .610. 


The coefficient of net correlation, 24 , is negative, having 
the same sign as the coefficient of net regression, & 13 . 24 . 

The quantity ri 8 24 measures the degree of correlation 
between Xi and Xz when neither one is affected by variations 
in Xi and X 4 . It may be thought of, equally well, as a 
measure of the degree to which errors in estimating Xi 
are reduced when use is made of Xz, after full account has 
already been taken of the influence of Xz and Xz on Xi. 

The meaning of the symbols employed in the above 
demonstration should be clear from the context. As in 
the coefficients of net regression, the first of the subscripts 
to the left of the point (the primary subscripts) refers to 
the dependent variable; the second of the primary sub- 
scripts refers to the single independent variable to which 
the measure of net correlation applies specifically. The 
subscripts to the right of the point (the secondary sub- 
scripts) indicate the variables which are held constant for 
the purpose of the particular comparison being made. The 
number held constant is two in the present case, though 
it might be one, or any other number. Thus the general 
formula for the coefficient of net correlation between vari- 
ables Xi and X 8 would be 


r*18J4B« 


_ , _ <8^28468 ■ . . n ^ 
“ 5*1^486 n 


The variable that is present in the numerator and absent 
in the denominator is the particular independent variable 
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that is being paired with the dependent variable for the 
purpose of measuring net relationship. 

The coefficients of net correlation between Xi and each 
of the other independent variables may be derived in similar 
fashion. Thus 


j'*is.34 = 1 — 


5 * 1.34 


r*i4.23 = 1 - 


5*1.334 

5*1.23’ 


In each case the difference between numerator and denomi- 
nator of the fraction in the right-hand member measures 
the net reduction in the variability of Xi which is associated 
with a relationship between Xi and a single independent 
variable, accoimt having already been taken of the influence 
of two other variables. 

It is clear that such measurements as these are net only 
with respect to the variables represented by the secondary 
subscripts. The coefficient ri 2 84 measures the degree of 
relation between Xi and Xs when X 3 and X 4 are held con- 
stant. There may be many other factors affecting Xi and 
Xs; the disturbing influences of such factors have not been 
eliminated. These other factors still muddy the water of 
analysis. Ignoring them is not the same as holding them 
constant. Only by direct measurement and inclusion in 
the study, as was done with X3 and X4, may the influence 
of additional variables be effectively eliminated. 


Anothee Method op Computing Coefficients op 
Paetial Coeeelation 

Obviously a whole series of coefficients of net correlation 
may be computed in dealing with a number of variables. 
In deriving a number of such measurements a method may 
be utilized which differs somewhat from that employed 
above, and which has certain advantages in the way of 
systematic arrangement. 
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A simple coefficient of correlation relating to but two 
variables is termed a coefficient of zero order. Such coefficients 
are represented by symbols of the type ru, rs 4 , etc. Coeffi- 
cients of net correlation which relate to two variables, 
while a single additional variable is held constant, are 
termed coejficients of the first order, and are represented by 
symbols such as ri2.s, r24 3 , etc. Similarly, we may have coeffi- 
cients of the second, third, fourth, or nth order, depending 
upon the number of variables held constant while the relation- 
ship between a single dependent and a single independent 
variable is being measured. 

It is possible to derive each coefficient of partial correla- 
tion from those of the next lower order. Thus a coefficient 
of the first order may be derived from the relation 

ri2 — ris • rss 

S - (1 _ (1 _ ^*23)*' 

For a coefficient of the second order 

_ Tj2.S “* ^4.3 * ^24.3 

(1 (1 -rVs)*’ 

As a general equation for a coefficient of net correlation 
of any order, ^ we have 

_ ^12.346 . . . (»~1) yin.846 . . ■ (n^l) ‘ ^2n.345 . . . (t»r-l) ' 

12.846 . . . n Q _ r^ln.846 . . . (n-1))^ (1 ^*2n.34B . . . (n-1))* 

Thus it is possible, starting with the zero order coefficients 
of correlation, to compute aU hi^er order coefficients 
successively. The mere arithmetic of calculation would be 
laborious, but certain prepared tables reduce these computa- 

^ It will be noted that in an equation used in computing a coefficient of 
partial correlation the three r’s in the numerator of the right-hand member 
have the same secondary subscripts, and that these secondary subscripts are 
one less in number than the secondary subscripts of the left-hand member; 
that the first r in the numerator has the same primary subscripts as the left 
hmid member; that the second and third r^s in the numerator have primary 
subscripts composed of one of the primary subscripts of the left-hand memb^ 
plus the tniafling secondary subscript; that the two r’s in the denominator are 
the same as the second and third r’s in the numerator. 
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tions to a namimum.^ The method may be illustrated, using 
the data of the preceding problem. 

In the present case we require three coefficients of the 
second order, ri 2 . 84 , ri 8 S 4 , and ri 4 83. These will serve as 
measures of the net correlation between corn 3 deld and 
temperature in each of the three critical months. The 
formula from which the first of these measures may be 
computed was given above. For the second, we have 

_ riS2 “ ^14 2 * ^34 2 

- (1 _ r ^4 2 )* (1 - 

and for the third 

ri4 a — ria 2 • ^433 

- (1 _r\8j)i(l -r»4s2)»‘ 

But each of these values may be derived from a slightly 
different grouping of first order coefficients. We may use 
the three formulas 

_ r32.4 — ri8.4 • r23.4 

(1 -r*i8.4)i (1 -rW* 

_ TiS 4 — y ’12 4 ‘ 7*32 4 

(l-r*i2 4)*(l-rW* .. 

_ ?^14,8 — ri2.8 * ^42 3 

(1 - r*j2.8)i (1 - r*42.8)4' 

By employing both methods in computing each second 
order coefficient a check upon the calculations is afforded. 


COMPUTATION OP FIRST ORDPR COEFFICIENTS 

The second order coefficients cannot be computed until 
aU necessary first order coefficients have been secured. 
The necessary equations, of the type 


ri2.8 


ri2 — ri8 • r23 

(1 — r*i8)* (1 - r*2s)* ’ 


may be constructed from the general formula for coefficients 
of partial correlation. Since several of these values must 
be computed, a systematic arrangement should be employed. 

* J. R. Miner, Tdbilea cf Vi — r* and 1 ~ r^JcriminPartiaLCorrdationmd 
in Trigommary, Johns Hopkins Press, Baltimore, Md., 1922. 
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Table 130 


lUustrating the Computaiion of First Order Coeffiderds of Paaiial 

Correlation 

(Kansas com yield and temperature) 


) 0 Order 

Sub- Coef- 
script ficient 


I Product 
(1 — term of 
numerator 


Whole 

numerator 


Denom- 

inator 


r let Order 

Sub- Coef- 
script fidmt 


12 

13 

23 

1 1 + 

.4984 

.6948 

3938 

7192 

.9192 

- 2736 

- 2248 

6611 

12 3 

- .3400 

14 

— 

5013 


- .1993 

- 3020 

6890 

14 3 

- .4383 

13 

— 

.6948 

7192 






43 

+ 

.2868 

9580 






24 

+ 

2775 


+ .1129 

+ 1646 

8806 

24 3 

-1- 1869 

23 

+ 

3938 

9192 






43 

+ 

2868 

9580 







13 - 6948 - 1963 -- .4985 . 7969 13 2 - 6265 

12 - .4984 . 8669 

32 + .3938 .9192 


14 - 6013 - 1383 - 3630 . 8329 14.2 - 4358 

12 - 4984 8670 

42 + 2775 9607 

34 +.2868 + 1093 +.1775 . 8831 34.2 +.2010 

32 + .3938 .9192 

42 + 2775 .9607 

12 - 4984 - 1391 -.3593 . 8313 12 4 - 4322 

14 - .6013 8653 

24 + .2776 9607 

13 - .6948 - .1438 - .5510 . 8290 13.4 - .6647 

14 - .5013 8653 

34 + .2868 9580 


23 + .3938 + 0796 + 3142 . 9204 23.4 + .3414 

24 + 2775 . 9607 
34 + .2868 9580 
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The procedure in computing each first order coefficient 
is simple. Three zero order coefficients are necessary for 
each calculation. These should be arranged in the table 
in the order in which they occur in the numerator of the 
fraction from which the required coefficient is to be com- 
puted. The numerator of this fraction is secured by sub- 
tracting from the first zero order coefficient the product of 
the other two. This product term appears in one column 
of the table. The denomin ator of t he fraction is the product 
of two terms of the type Vl — r*, derived from the second 
and third coefficients in each group of three. The tabular 
arrangement of Table 130 on page 567 permits these com- 
putations to be carried forward systematically. 

The coefficient r2s.4 is, of course, identical with r8j4; 
r34 2 is identical with r48 a, etc. It is unnecessary to duplicate 
the work of computation with respect to these measures. 

COMPUTATION OP SECOND OKDBB COBBTICIBNTS 

From these first order coefficients the three required 
second order coefficients may be secured by methods analo- 
gous to those employed above. The computations are 
shown in Table 131 . As a check upon the calculations each 
required measure is computed from two diflierent combina- 
tions of the first order coefficients. 

The value of it will be noted, is the same as that 
derived from the relation between Sim and (81.2*4. 

The meaning of such coefficients as these was ei^lained 
in the earlier section dealing with this problem. The follow- 
ing summary of results reveals the gain in knowledge which 
has resulted from the above analysis. 


rj2 = — .4984 

^12.84 = 

.2923 

ri» = - .6948 

^1854 — 

.6101 

ri4 = - .6013 

^14^8 == — 

.4067 


It is clear that the net effect of June temperature upon 
com yield is distinctly less than was indicated by the simple 
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Table 131 

Illustrating the Compviaiion of Second Order Coefficients of Partial 

Correlation 

(Kansas com yield and temperature) 


r Ist Order 


Product 

Whole 

numerator 

Djmom- 

inator 

r 2nd Order 

Sub- 

script 

Coef- 

ficient 

term of 
numerator 

Sub- 

script 

Co^- 

fident 

12.3 

14.3 

24.3 

- .3400 

- .4383 
+ .1869 

.8988 

.9824 

- .0819 

- 2581 

8830 

12 34 

- .2923 

13.2 
14 2 
34 2 

- 6255 

- 4368 
+ 2010 

.9000 

.9796 

- 0876 

- .5379 

.8816 

13.24 

- .6101 

14 2 
13 2 
43 2 

- .4358 

- .6265 
+ 2010 

.7802 

.9796 

- .1257 

- .3101 

.7643 

14.23 

- .4057 

12 4 

13 4 
23 4 

- .4322 

- .6647 
+ 3414 

.7471 

.9399 

- .2269 

- .2053 

.7022 

12 34 

- .2924 

13 4 
12 4 
32 4 

- .6647 

- .4322 
+ 3414 

.9018 

9399 

- .1476 

- .5171 

.8476 

13.24 

- .6101 

14 3 
12 3 
42 3 

- .4383 

- .3400 
+ .1869 

.9404 

.9824 

- .0635 

- .3748 

.9238 

14 23 

- .4057 


correlation. This is so because there is a positive correlation 
between temperature in June and temperature in July and 
August, so that the crude correlation of two variables 
alone shows June temperatxire as more important than it 
really is. For the same reason, all the net coefficients are 
less than the simple coefficients, thou^ it is still apparent 
that July temperature is far more important, in relation 
to com yield, than the temperature in either of the other 
months. 
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The coefficients of net correlation are net, of course, 
only with respect to the variables actually taken account 
of, and held constant. Thus there may be other factors, 
such as rainfall in June, July, or August, which affect 
corn yield and which are correlated with the temperature 
dxiring these months. Were these included the various 
coefficients of net correlation might have different values 
from those given. 

The sampling error of a coefficient of partial correlation 
may be estimated from the same general relations that 
hold for zero order coefficients, except that the factor N — 1 
must be further reduced by the number of variables held 
constant. Thus for r^M we have 



A MBASUKB OF VAKIABIUTT , 

Having these coefficients of net correlation, another 
measure of some importance may be computed. This is 
a measure of the variability of a single character while a 
number of related variables are held constant. Thus the 
question might arise: If we could hold constant the tem- 
perature in Kansas in Jime, Jffiy, and August, what would 
be the variability of the corn yield? In other words: If we 
could eliminate such variability in com yield as is due to 
variability in temperature, what fluctuations would remain 
in the yield of com? This measure of variability is repre- 
sented by the symbol o-i.m . . . It is termed the standard 
deviation of order n. 

This measure may be computed from the general equation 
cr\28 . . . « = <ri*(l - . . . 

(1 , . . »— i). 

Applying this formula to the results of the study of corn 
yield, we have 
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o-*ij34 = 44.6878[1 -(- .4984)*][1 -(- .6255)*][1 -(- .4057)*J 

= 17.0797 
<>■ 1.234 = 4.13. 

Referring back to the discussion of this problem we find 
that the values of ffim and 5 i .284 are identical. That is, 
the standard deviation of variable Xi, when variables X 2 , 
Xz, and X 4 are held constant, is merely the standard devia- 
tion of observed values from computed values of Xi. It 
is the standard error of estimate, when estimates are based 
upon the factors Xz, Xz, and X 4 . The reason for this is 
obvious. The variability of the original series is reduced 
to the extent that estimates based upon the equation of 
relationship approximate the actual values. The variability 
which remains is due to differences between these estimates 
and the actual values. But these differences are merely 
the residual deviations, from which S is computed. A re- 
alization of the identity of these two measmes may assist 
in making their meaning clear. 

Since <r 1.234 and ;Si 2 S 4 are identical, the coefficient of 
multiple correlation, R^m, may be computed from the 
equation 


or, using the formula for <r*i 2 S 4 . . . n, from the equation 

1 ~ R*1.23 . . . n = (1 — r*i 2 )(l — r®i8a)(l “ r*i 4 ^ 8 ) . . . 

(1 — r\nstz . . . (n-l)» 


BETA COEFETCIBNTS 

The several coefficients of regression in an equation of 
multiple regression are, in effect, weights applied to the 
different independent variables in estimating the successive 
values of the dependent variable. Usually these coefficients 
of regression are not comparable, because the independent 
factors are e:q)ressed in different units, or because they 
differ in variability. It is often desirable to reduce the 
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coefficients of regression to comparable terms. This may 
be done by expressing dependent and independent variables 
flliVp. in units of their respective standard deviations. The 
coefficients of regression are then called coefficients, 
and are represented by the symbols Pum, etc. 

In terms of a simple two-variable problem, we have 

~ bltXSm 

If we change to standard deviation units we must divide 
both sides of the equation by ai and by <ra. This gives 

xi _ 

Viffs 0-1 \asj 

or 

The desired Beta coefficient is, then, 

ft, = 6w(^)- 

For the com yield example, we have 

fe-- 1.886(11) =-.686. 

This may be taken to mean that with an increase of one 
standard deviation in Xz (July temperature), the yield of 
corn decreased .696 of one standard deviation. 

These measurements are particularly useful in analyses 
involving more than two variables. Here the relationships 
between the beta coefficients and the coefficients of net 
regression are similar to those indicated for the two- 
variable problem. Thus 

|9l2.g4 = 6 i2.I 
= blSJl 
ft4.SS = &1443 
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Substituting the required values in these equations, we have 


/3i2 84 = — . 209 
/Siam = — .529 
j8i4S3 = ~ .292. 

The second of these coefficients may be taken to mean that 
with an increase of one standard deviation in July tempera- 
ture, when June and August temperatures are held constant, 
corn yield decreases .529 of one standard deviation. The 
other coefficients have similar meanings. 

The beta coefficients relate to factors expressed in com- 
parable units and similar in respect of variability. A 
fluctuation of one standard deviation in Xi may be taken 
to be equal to a fluctuation of one standard deviation in X». 
The coefficients defining the changes in Xi that are likely 
to accompany these equal movements in Xs and Xa have 
obvious significance. 

CERTAIN LIMITATIONS 

The measures we have described in dealing with problems 
of multiple and partial correlation are valid on the assump- 
tion that the relationships among the different variables 
are in all cases linear. Thus with four variables six different 
pairs may be obtained; The regression in each of these 
six cases should be linear if combined or net effects are to 
be studied by the methods outlined above. If the regression 
is non-hnear when natural munbers are dealt with, it may 
be possible to secme linear relationships by correlating 
logarithms or reciprocals. Thus we mi^t derive an esti- 
mating equation of the type 

Log Xi = o - 1 - 613 . 34 X 8 613 . 84 X 3 ■+• 61483 X 4 

if the relation between Xi in logarithmic form and each 
of the other variables in the original arithmetic form were 
linear. The corresponding measures, S and B, would then 
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relate to ratios, as in the examples given in the following 
chapter.^ 

One other important limitation should be noted. Coeffi- 
cients of multiple or of net correlation based upon a large 
number of variables have little significance unless the n\un- 
ber of observations be large. Misleadingly high values will 
be secured when studies involving many variables are based 
upon RTnnll samples. (Application of the corrections referred 
to in the text will prevent misinterpretation, in such cases.) 
Within the limits set by these restrictions, the methods of 
multiple and partial correlation constitute very powerful 
instruments of economic analysis. 
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CHAPTER XVn 


THE MEASUREMENT OF RELATIONSHIP AND 
THE PROBLEM OF ESTIMATION 

It is no great exaggeration to say that quantitative 
method in economics and business centers about the prob- 
lem of estimation. Equations of regression, measures of 
standard error and coefficients of correlation are of interest 
largely because of their bearing upon the practical problems 
of determining probable production, probable price, probable 
business changes. It should not be understood from this 
that the problem of estimation relates only to attempts to 
forecast future changes. We make an estimate whenever 
we seek to determine the most probable value from a 
number of different observations, or whenever we employ 
an equation which describes the relation between two or 
more variables. The value of statistical technique rests 
in large part upon its practical utility in the making of 
estimates. 

This object has been de&iitely to the fore in the preceding 
chapters, which dealt with methods by which the value 
of one variable might be estimated from a given value 
of another. We may, at this point, briefly summarize 
certain assumptions upon which the validity of this method 
rests. 

Some Assumptions Involved in the Marino op 
Estimates 

In earlier chapters it has been pointed out that the most 
probable value of a series of observations is their arithmetic 
mean. Given a normal distribution about the mean, the 
standard deviation affords an exact measure of the proba- 
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bilities involved in basing estimates upon the mean. 
Similarly, the standard error of estimate affords an exact 
measure of the probabilities involved in basing estimates 
upon an equation of regression, again upon the assumption 
that the distribution about the line of regression is normal. 
The significance and usefulness of the equation of regression 
may be determined by comparing the standard error of 
estimate of a given variable with the standard deviation. 

From the relation between these two values, moreover, 
an abstract measure of relationship, the coefficient or index 
of correlation, may be computed. This coefficient, or index, 
is a thoroughly valid and accurate measure only if the 
distribution about the line of regression and the distribution 
about the mean are normal, or approximately so. Pro- 
nounced departures from the normal type lessen the signifi- 
cance of these measures. 

In the foregoing discussion we have been concerned with 
arithmdic values throughout. In speaking of estnnates 
based upon the mean we referred to the arithmetic mean. 
The distributions about the mean and about the line of 
regression are assumed to be normal when deviations are 
measured arithmetically. The standard deviation and the 
standard error of estimate are in arithmetic terms, referring 
to absolute values. But may we assume that all the 
distributions we deal with in economic analysis are of the 
arithmetic type? Should estimates be made and errors 
of estimate measured only in arithmetic terms? If they 
should not be so limited, are the methods developed above 
capable of adaptation to other distributions? These ques- 
tions may best be answered in terms of a specific problem. 

A Pbobubm of EsTmATioN: Logabithmic and Ratio 

Values 

In Table 132 the production and price of oats in the 
United States from 1881 to 1913 are recorded. Appropriate 
lines of trend were fitted to these series and the ratios 
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Table 132 


PTodudion and Price of Oats in the United States 


Yew 

Prodvction 
of oats in 
U.S. 
{millions 
qfbu.) 

Straight 
line trend 
of 'producr 
Uon ^ 

Ratio of 
actual 'pro- 
duction to 
trend value 

Price of 
oats in 
Chicago 
{cents 
per hu.) 

Straight 
line trend 
of prUx ® 

Ratio of 
actual 
price to 
trend value 

1881 

416 

448 

.929 

47 

36 0 

1.30 

1882 

488 

471 

1.036 

37 

35.3 

1 05 

1883 

571 

494 

1 156 

31 

34 6 

90 

1884 

583 

517 

1 128 

29 

34 0 

.85 

1885 

629 

540 

1.165 

28 

33.2 

.84 

1886 

624 

563 

1.108 

25 

32.5 

.77 

1887 

659 

586 

1.124 

30 

31.2 

.96 

1888 

701 

609 

1.151 

24 

30 5 

.79 

1889 

751 

632 

1.188 

24 

29.8 

.81 

1890 

523 

655 

.798 

43 

29 0 

1.48 

1891 

738 

678 

1.088 

31 

28 3 

1 10 

1892 

661 

701 

943 

30 

27.5 

1.09 

1893 

639 

724 

.882 

31 

26 8 

1 16 

1894 

662 

747 

.886 

28 

26 1 

1 07 

1895 

824 

770 

1 070 

19 

25 3 

75 

1896 

780 

793 

.983 

18 

23 6 

.76 

1897 

791 

816 

969 

24 

25 0 

96 

1898 

843 

839 

1.005 

25 

26 4 

95 

1899 

926 

862 

1.074 

23 

27.8 

83 

1900 

914 

885 

1.033 

25 

29.2 

86 

1901 

778 

908 

.857 

42 

30 6 

1 37 

1902 

1,053 

931 

1.131 

33 

32 0 

1.03 

1903 

869 

954 

.911 

38 

33 4 

1.14 

1904 

1,009 

977 

1.033 

30 

34.8 

86 

1905 

1,090 

1,000 

1.090 

31 

36.2 

86 

1906 

1,036 

1,023 

1.013 

39 

37.6 

1.04 

1907 

805 

1,046 

.770 

51 

39.0 

1.31 

1908 

851 

1,069 

.796 

52 

40 4 

1.29 

1909 

1,068 

1,092 

.978 

43 

41 8 

1.03 

1910 

1,186 

1,115 

1064 

35 

43.2 

.81 

1911 

922 

1,138 

.810 

51 

44.6 

1.14 

1912 

1,418 

1,161 

1.221 

37 

46 0 

.80 

1913 

1,122 

1,184 

.948 

41 

47.4 

.87 


1 This line of trend was fitted to data covering a longer period than that in- 
duded in the present study. 

2 The entire period has been broken into two parts, 1881 to 1896 and 1896 
to 1913. A straight line of trend was fitted by H. B. Eillough to the data of 
each period. 
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of the actual values of the items ia each series to the trend 
values determined. 

It is desired to measure the relation between these two 
variables. A hyperbolic curve of the general type Y = oX* 
appears to be an appropriate form to employ in describing 
such a relationship. To fit this curve by the method of 
least squares, the equation must be reduced to the loga- 
rithmic form 

log Y = log a -b 5 log X. 

The normal equations required in fitting a curve of this 
type, are 

I S(log Y) =Nloga + b S(log X) 

II S(log X • log F) = log aS(log X) + 6S(log* X). 

The values necessary for the solution of these equations 
are determined from Table 133.^ 

Prom this table we have 

JV = 33 

Sfiog F) = - -32849 2(logX • log F) = - .1143006 

saogX) = .037535 SGog^X) = .096423. 

Substituting in the normal equations, we secure 

- .32849 = 33 log o + .0375356 
- .1143006 = .037535 log a -b .0964236. 

Solving 

loga — .00861 
6 = - 1.18206. 

The required equation is 

log F = (9.99139 - 10) - 1.18206 log X 
or 

F = .9804X-^-““®. 


1 1 am indebted to Prof. H. B. Eillough of Brown University for permission 
to use the data presented in Tables 132 and 133. The figures are takm fix>m bis 
comprehensive study of the factors aSeoting oat prices. 
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This is the equation which describes the average rela- 
tionship between the production and the price of oats 
(when the actual figures for each are expressed as ratios 
to the respective lines of trend). The corresponding curve 
is plotted in Fig. 88 on page 592. 


Table 133 


Comptdation of Values Required in Fitting a Curve to Data of Oat 
Production and Prices 
Example I 


(1) 

(2) 

(8) 

(4) 

(6) 

(6) 

(7) 


(8) 

Year 

Baiio 
ofvriee 
to trend 

Ratto 
OfpT<h 
Avction to 
trend 








Y 

X 

U>0 Y 

logX 

lO0*Y 

log»X 

logY-iooX 

1881 

1.80 

920 

1189434 

.9680157 - 1 

01298310 

001022995 

— 

0036444 

1882 

1 05 

1 036 

0211803 

0153598 

.00044899 

000285923 


0008255 

1833 

.00 

1 156 

9542425 - 1 

0620578 

00209375 

003963685 

— 

.0028808 

1884 

85 

1 128 

9294180 - 1 

0523001 

00498169 

002736242 

— 

.0036920 

1885 

.84 

1 165 

9242793 - 1 

0663259 

00573362 

004399125 

... 

.0050222 

1886 

.77 

1 108 

8864907 - 1 

0445398 

01288436 

001983794 

— 

0050557 

1887 

96 

1 124 

9822712 - 1 

0507663 

00031431 

002677217 

— 

0009000 

1888 

.79 

1 151 

8976271 - 1 

0610753 

01048021 

008730192 

... 

0062524 

1889 

,81 

1.188 

9084860 - 1 

0748164 

.00837500 

005597494 

— 

0068468 

1800 

1 48 

798 

1702617 

9020029 - 1 

02898905 

009603432 

— 

.0166852 

1891 

1.10 

1 088 

0413927 

0366289 

00171336 

001341676 


0015162 

1892 

1.00 

.043 

0374265 

9745117 - 1 

00140074 

000649653 

— 

0009539 

1893 

1 16 

882 

0644580 

9454686 - 1 

00415483 

002973674 

— 

.0035150 

1894 

1.07 

.886 

0293838 

9474337 - 1 

00086341 

002763216 

— 

.0015446 

1895 

76 

1 070 

.8760613 - 1 

0293838 

01560968 

.000868408 

— 

0086712 

1896 

.76 

983 

8808186 - 1 

0025536 - 1 

01420540 

000055450 


0008875 

1897 

.06 

960 

9822712 - 1 

9863238 - 1 

.00031431 

000187038 


0002425 

1808 

.06 

1 005 

0777236 - 1 

0021661 

00049624 

000004692 

— 

0000483 

1899 

.83 

1 074 

9190781 “ 1 

0310048 

00654885 

000961267 

— 

0025089 

1900 

.86 

1 033 

0344985 - 1 

0141003 

00429045 

000198818 

— 

0009236 

1901 

1 87 

867 

1367206 

9329808 - 1 

.01726316 

004491573 

— 

0091629 

1902 

1 03 

1 131 

0128372 

.0534626 

00016479 

.002858250 


.0006863 

1903 

1.14 

oil 

0569049 

.9596184 -- 1 

.00323817 

.001638760 

— 

0023036 

1904 

86 

1 033 

9344086 ~ 1 

0141003 

.00429045 

000198818 

— 

.0009236 

1905 

.86 

1 090 

9344985 - 1 

0374265 

.00429045 

.001400743 

— 

0024615 

1906 

1 04 

1 013 

0170833 

0056094 

00029013 

000031465 


.0000955 

1007 

1 31 

770 

.1172713 

8864907 ~ 1 

01375256 

012884361 


0138113 

1908 

1 29 

.796 

.1105897 

9009131 - 1 

01223008 

009818214 

— 

0109580 

1909 

1 03 

978 

0128372 

9903389 - 1 

00016479 

000093837 

— 

0001240 

1910 

81 

1 064 

9084850 - 1 

.0269416 

00837500 

000725850 

— 

0024656 

1911 

1 14 

810 

0660049 

9084850 - 1 

00323817 

.008374812 

— 

0052076 

1912 

.80 

1 221 

0030900 - 1 

0867167 

.00939155 

.007519613 

— 

.0084036 

1013 

87 

948 

9395193 - 1 

9768088 - 1 

00865792 

000537855 


0014027 

Total 

82 88 

S3 338 

17,6715068 - 18 

14 0376350 - 14 

.21721807 

.096422642 


1194567 








+ .0061562 








— 

.1143005 


THE STANBABD EBBOB OE ESTIMATE IN' LOOABITHMIC TEBMB 

How reliable is this equation? With what degree of 
confidence may estimates be based upon it? To answer 
these questions we must compute the standard error, S. 
Since the fitting process was carried throu^ in terms of 
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logarithms, the standard error may be computed in the 
same terms. Following the procedure explained in earlier 
sections with reference to the straight line and the potential 
series, we may derive the following equation relating to 
the logarithmic curve just fitted: 

o, _ S(log® Y) - log aS(log Y) - 6S(log X • log F) 

^ log V “ • 

Substituting the proper values, we have 
S\)g y 

.21721807 - ( - .00861 X -.32849) - (- 1.18206 X - .1143005) 

33 

.07927928 
“ 33 

8^^ogy = .0024024 
/Siog» = .04901. 

The standard error of estimate, in the form of a loga- 
rithm, is .04901. As long as we deal with logarithms, 
this is to be interpreted precisely as is the standard error 
with respect to other curves. Assxuning a normal distribution 
of logarithms about the curve which describes the average re- 
lationship, the chances are 68 out of 100 that the logarithm 
of a given estimate will not differ from the logarithm of the 
actual value by more than . 04901, 95 out of 100 that the 
logarithm of the given estimate will not differ from the 
logarithm of the actual value by more than .09802, and 
99.7 out of 100 that the logarithm of the given estimate 
will not differ from the logarithm of the actual value by 
more than . 14703. 

INTERPRBTATION OP THE STANDARD ERROR OP ESTIMATE; 

ZONES OF ESTIMATE 

What does this mean in terms of actual values? It means, 
simply, that we are dealing thxou^out in terms of ratios 
instead of absolute figures. The difference between the 
logarithms of two numbers is the logarithm of the ratio 
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of one of the original numbers to the other. Thus the 
absolute value of S in a given case will depend upon the 
magnitude of the values with which we are dealing. If 
the user desires to reduce S to absolute values, it must be 
done always with reference to a given estimate. That is, 
a given value of X is substituted in the equation of average 
relationship and the corresponding value of Y estimated. 
If the logarithmic equation is used, this estimate will be 
m the form of a logarithm. To the logarithm of the estimate 
add the value of Siogy The anti-logarithm of the number 
thus secured will give the upper limit of a zone extending 
a distance equal to S above the line of regression. From 
the logarithm of the estimate subtract the value of Siogy 
The anti-logarithm of the number thus secured will give 
the lower limit of a zone extending a distance equal to 
S below the line of regression. The odds are 68 out of 100 
that the value of 7 in the given case will fall within the 
limit s thus marked out. The absolute limits corresponding 
to 25 and 35 may be similarly determined. 

The zone thus marked out with respect to a logarithmic 
curve will differ materially from the similar zones already 
described m dealing with simple hnear equations. In the 
simple case a zone extending 15 on each side of the estimating 
curve has the same absolute width throughout its length, 
and is centered always at the line of regression. The loga- 
rithmic zone, when measured in natural numbers, is of 
varying width, and, moreover, is not of the same width 
on each side of the plotted curve. It is true, however, that 
the ratios on the two sides of the, curve are always equal. 
That is, the ratio of a value 15 less than the computed 
value to the computed value is the same as the ratio of 
the latter to a value 15 greater. And when the curves 
are plotted on paper ruled logarithmically, the zone included 
within a distance 15 on each side of the plotted curve 
takes the ssunmetrical form found in the earUer and simpler 
cases. A person accustomed to thinking in terms of ratios 
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and to the use of logarithmic paper can readily interpret 
this measure. 

THE STANDAED ERROR OP ESTIMATE IN TERMS OP RATIOS 

Since the ratios are equal throughout, the standard error 
of estimate may be expressed in ratio terms. In the present 
example we have 

iSr = anti-log Slog » = anti-log .04901 = 1.12 

where Si is used to represent the standard error of estimate 
in terms of ratios. Siogj,, as derived above, is positive, 
hence the ratio exceeds unity. It is the ratio of the larger 
number to the smaller. What does it mean? It means 
that in 68 cases out of 100 the actual value, if it exceed 
the estimate, will not exceed it by more than 12 per cent, 
and, if it fall below the estimate, will stay within a 
limit such that the estimate will not be more than 12 per 
cent greater than the actual value. This is not a conven- 
ient form, since this ratio always expresses the larger value 
in terms of the smaller value. It would be more conven- 
ient to have it always in terms of a percentage of the esti- 
mate. This may be done by putting Si^gy in negative terms, 
and getting the corresponding natural value. The value 
— . 04901 = 9 . 95099 — 10, which is the logarithm of . 8933. 
In this form the ratio is based upon the relation of the 
smaller to the larger number. To make Sr readily intelligible 
we may combine the two, writing 

St = .89 to 1.12. 

Interpreting this, it means that, given a normal distribution, 
in 68 cases out of 100 the actual value will not be less than 
89 per cent of the estimate, or more than 112 per cent of 
the estimate. This has a simple, definite meaning more 
significant for most practical purposes than a similar 
measure in terms of absolute values.^ 

1 The significance of a measure of reliability la percentage form was pointed 
out by D. H. Davenport in 1922, in an unpublished article, and such a measure 
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To find the values of 2S or 3S these percentage figures 
may not be simply multiplied by 2 or 3. The value of 
Siogj, must be so multiplied, and the resulting values reduced 
to natural numbers. For convenience in use, the anti- 
logarithms of both the positive and negative values should 
be secured, as in the preceding case. The computations 
are simple. 

2Sio*v = 09802. 

The anti-logarithm of this value, when considered positive 
is 1.25, when negative, .80. 

3Siog„ = .14703. 

The corresponding anti-logarithms are 1.40 and .71. Sum- 
marizing for the standard error, we have 

Sr = .89 to 1.12 
2Sr = .80 to 1.25 
3& = .71 to 1.40. 

The values given for Sr indicate the probable percentage 
limit s within which actual value and estimated value should 
fall in 68 out of 100 cases. The values given for 2Sr indicate 
the probable percentage limits in 95 out of 100 cases. 
The values of 3Sr indicate the probable percentage limits 
in 99.7 cases out of 100, always on the assumption of a 
normal distribution of the logarithms of the actual values 
about the fitted curve. 

APPUCATION OF THE BTANBAIU) BEROB OF BSTIMATB 

We may illustrate the use of Su,g„. Given a production 
of oats 50 per cent above the trend value (i.e., the ratio to 
trend is 1.50), what is the most probable accompanying 
price ratio and what is the degree of accuracy of this 
estimate? 

lias beea employed in several studies. There has not been available, however, 
a ready method of computing this measure, and its possibilities have not, there- 
fore, been fully realized. 
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The estimating equation is 

log Y = (9.99139 - 10) - 1.18206 log X. 

Substituting in this equation the value . 176091 (the loga- 
rithm of 1.50) we secure for log Y the value 9.78324 — 10. 
The corresponding natural number is .607. This means 
that if production is 150 per cent of normal (as measured 
by the given line of trend) price will probably be 60.7 
per cent of normal (as measured by the line of trend). 

To determine the reliability of this estimate, the standard 
error must be secured. Employing the values of Sr already 
computed we find that 54 is 89 per cent of 60.7, while 
68 is 112 per cent of 60.7. We interpret these figures to 
mean that in 68 cases out of 100 the actual price prevailmg 
under the given production conditions will not be less than 
54 per cent of the normal or trend value nor more than 
68 per cent of normal.^ Correspondiug values for 2Sr 
and ZSr may be determined in the manner outlined above. 

THE INDEX OF CORBELATION BASED ON LOGARITHMIC VALUES 

We have still to compute the third measure, the abstract 
index of correlation.® For an equation of the type 

log F = log o + 6 log Z 

the formula for p reduces to 

, _ log aS(log Y) + 6Z(log X • log F) — Nc\,g„ 

P%g » log « - 20og® F) - Ne\^ „ 

where Cioey represents the difference between the arithmetic 
mean of the logarithms of the F-values and the origin 
(in this case, zero on the logarithmic scale). Substituting 

^ A question arises at once as to the adequacy of the given lines of trend, 
in the present problem. This question is discuss^ in greater detail in another 
section. 

* The symbol p is used for this measure of correlation, instead of r, even 
though the relationship in logarithmic form is linear. This is done because such 
a measure, in terms of logarithms, cannot be inteipreted in predsdy the same 
way as the ordinary coefficient of correlation. 
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the proper values, we have 


P^og V log « 

(-.00861 X - .32849) + (- 1.18206 X - .1143005) - (33 X .00009909^ 

“ .21721807 - (33 X .00009909) 

.13466882 
~ .2139481 

= .629445 

Piog » log » ~ • ^93. 

The index of correlation has a value of .793. How is 
this to be interpreted when we are dealing with logarithms 
as in the present case? 

Its significance may be clearer if viewed in terms of the 
relationship 


P^og ]/ log X — 1 


^^Og V 
V*log » 


In the present case these values are 


(Slog y — • 04901 

Viog V “ • 08062. 

When these values are squared and inserted in the above 
formula, we have 

^ , .002402 

P%gyiog* 1 .006483 


and 


Piog V log X — .793. 

What does this value measure? We have seen that r 
and the more general index p are abstract measures of the 
degree of relationship between two variables, as this relar 
tionship is described by given functions. The value of p 
in a given case depends upon the variability about the 
fitted line, in relation to the variability about the mean 
of the Fs. If the variability of esthnates is materially 
reduced when the equation of regression is used as a basis 
for estimates, instead of the mean Y, the equation may be 
assumed to describe a significant relation^p. The value 
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of p depends thus upon the relation between the two 
quantities, S„ and <t„. 

In the cases dealt with in the preceding chapter the 
variability in each case was measured in terms of absolute 
deviations, and the value of p depended upon the relation 
between the two given measures of absolute variability. 
The sole difference in the present case is that we are working 
in terms of logarithmic or ratio variability, deviations being 
measured in terms of logarithms instead of natural numbers. 

nie index p must be interpreted in the light of this fact. 
Its value, as always, depends upon the relation between 
two measures of variability, and o-®, but in the present 
instance these are expressed in terms of logarithms. In 
brief, the value of p depends upon the relation between 
the ratio variability about the fitted curve and the ratio 
variability about the geometric mean of the F’s. (It is 
the geometric mean of the F’s, because that is the value 
corr^pondmg to the arithmetic mean of the Y logarithms.) 

We have here a set of measures, therefore, which perform 
in the field of ratios precisely the same service as is per- 
formed in the field of natural numbers by S and p (in the 
linear case, r). These measures are secured in the same 
way as are S and p, except that the equation of relationship 
from which they are derived is one in which the dependent 
variable is log Y (or, in the reverse case, log X). The general 
formulas for computing these values are the same as in 
dealing with natural numbers, except that log Y replaces 
Y throughout. The operation is analogous to that of using 
logarithmic paper instead of natural scale paper. 

It should be noted that the values are in logarithmic or 
ratio form if F is expressed logarithmically, whether X 
be so expressed or not. Thus we have fitted a curve of 
the type 

log F = log o -1- b log X 

the logarithmic form of the ordinary parabola or hypOTbola. 
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The values S and p would also be in logarithmic form if 
the curve were of the type 

log Y = log a + X log 6 

the logarithmic form of the exponential 


Y = o(6^). 

In each of these cases the logarithmic equation is linear, 
but t.bia is not essential to the use of these measures. jS and 
p are generally applicable measures, whether ratios or nat- 
ural numbers be dealt with, and whether the functions be 
linear or otherwise. 

It may be well at this point to summarize the symbols 
that have been used and to distinguish the different meas- 
ures. We may employ the symbols S„, <r», and p when 
arithmetic relations are in question, the two former being 
measures of variation m absolute terms, and the index p 
referring to degree of relationship when natural numbers 
are employed. If the logarithms of the F’s are used it is 
advisable to distinguish the symbols by subscripts, using 
Sicgv <’'iogv as measures of the logarithmic variation 
about the fitted curve and about the arithmetic mean 
of the logarithms of the F’s, respectively. If ^iog» is reduced 
to ratio form, it may be written Sr. Since the index p 
must be interpreted somewhat differently in this case, it 
may be v^ntten piog v los ur piog y®. 


The Use of Recipeocals in the Mbastjkbment op 

Relationship 

Another t3q)e of curve may be used to describe the 
relationship between the production and price of oats, and 
its use introduces us to a third field of correlation, a field 
in which somewhat new concepts enter, and in which the 
various measures must be interpreted in still another way. 
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This is a curve of the tjrpe 

Y = — ^ 

a + bX 

which may be expanded by adding additional terms to 
the denominator, as 

Y = 

ffl "b bX "b cX^ 

This hyperbolic form has been used in several studies as 
an approximation to a “demand*' curve for various com- 
modities. 

The equation to a curve of this type may be written 
p: = o -b 

which is the equation to a straight line describing the rela- 
tionship between the reciprocals of the T’s and the original 
X values. The normal equations required in fitting a 
curve of this type are 

I ^Na + 6S(X) 

II = aS(Z) + 62(X)*. 

The method of computing the necessary values is illustrated 
in Table 134. 

Substituting the proper values in the normal equations, 
we have 

34.3360320 = 33a -b 33.3386 
35.2671485 = 33.338o -b 34.1686645. 

Solving, 

o = - .1367 
6 = 1.1643. 



580 THE PEOBLEM OF ESTIMATION 

Table 134 

CtmputoHon of Values Required in Fitting a Curve to Data of Oat 
Production and Prices 


Examfld II 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 



Produo- 





Year 

Price 

tion 






RaMo 

Ratio 

1 

X 

(ly 



Y 

X 

Y 

Y 

\y) 


1881 

1.30 

.929 

7692308 

7146154 

.59171602 

863041 

1882 

1.05 

1 036 

.9523810 

.9866667 

.90702957 

1.073296 

1883 

.90 

1.156 

1.1111111 

1.2844444 

1.23456788 

1.336336 

1884 

.85 

1.128 

1.1764706 

1.3270588 

1.38408307 

1 272384 

1885 

.84 

1.165 

1.1904762 

1.3869048 

1.41723358 

1 357225 

1886 

.77 

1.108 

1.2987013 

1.4389610 

1.68662507 

1 227664 

1887 

.96 

1.124 

1.0416667 

1.1708334 

1 08506951 

1.263376 

1888 

.79 

1.151 

1.2658228 

1.4569620 

1.60230736 

1.324801 

1889 

.81 

1.188 

1.2345679 

1.4666667 

1.52415790 

1 411344 

1890 

1.48 

.798 

.6756757 

.5391892 

,45653765 

.636804 

1891 

1.10 

1,088 

.9090909 

.9890909 

.82644626 

1.183744 

1892 

1.09 

.943 

.9174312 

.8651376 

.84168001 

889249 

1893 

1.16 

.882 

,8620690 

.7603449 

.74316296 

.777924 

1894 

1.07 

.886 

.9345794 

.8280373 

.87343865 

784996 

1895 

.75 

1.070 

1.3333333 

1.4266666 

1.77777769 

1 144900 

1896 

.76 

.983 

1.3157895 

1.2934211 

1.73130201 

.966289 

1897 

.96 

.969 

1.0416667 

1.0093750 

1.08506951 

.938961 

1898 

.95 

1.005 

1.0526316 

1.0578948 

1.10803329 

1.010025 

1899 

.83 

1.074 

1.2048193 

1.2939759 

1.45158955 

1 153476 

1900 

.86 

1.033 

1.1627907 

1.2011628 

1.35208221 

1.067089 

1901 

1.37 

.857 

.7299270 

.6255480 

.53279343 

.734449 

1902 

1.03 

1.131 

.9708738 

1.0980583 

.94259594 

1.279161 

1903 

1.14 

.911 

.8771930 

.7991228 

.76946756 

.829921 

1904 

.86 

1.033 

1.1627907 

1.2011628 

1.352C8221 

1.067089 

1905 

.86 

1.(390 

1.1627907 

1.2674419 

1.35208221 

1.188100 

1906 

1.04 

1.013 

.9615385 

.9740385 

.92455629 

1.026169 

1907 

1.31 

.770 

.7633588 

.5877863 

.58271666 

.592900 

1908 

1.29 

.796 

.7751938 

.6170543 

.60092543 

.633616 

1909 

1.03 

.978 

.9708738 

.9495146 

.94259594 

.956484 

1910 

.81 

1.064 

1.2345679 

1 3135802 

1.52415790 

1.132096 

1911 

1.14 

.810 

,8771930 

.7105263 

.76946756 

.656100 

1912 

.80 

1.221 

1.2500000 

1.5262500 

1.56250000 

1.490841 

1913 

.87 

.948 

1.1494253 

1.0896552 

1.32117852 

.898704 

Total 

32.83 

33.338 

34.336032() 

35.2571485 

36.85702940' 

^.168554 


The desired equation is, therefore, 

i = - .1367 + 1.1643Z. 
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THE STANDARD ERROR AND THE INDEX OP CORRELATION IN 
TERMS OP RECIPROCALS 

To determine the utility of this equation we must have 
the standard error and the index of correlation. The two 
necessary formulas may be derived as in the preceding 

cases. Representing by y the reciprocal of an actual value 

we have, for each residual, 

d = a + - y- (1) 


Multipl 3 dng by d and summing 


S(d*) = aS(d) + bS(dx) 




Since 


we have 


2(d) = 0and2(dZ) = 0, 


2(d*) = -2(|)- (2) 

Multipljdng the residual equation (1) now by-pj and sum- 
ming, we have 

Substituting the equivalent of 2 in the preceding equa^ 
tion (2), we secure 

2(d*) = 2 ( 1 )* - a2(i) - 62^^) 


and for Si®, we have 


Si® = 


Ky)‘ - ) 
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Inserting this value of /Si* in the general formula for the 

V 

index of correlation 



V 


and simplifsdng, we have 



Inserting the proper values in these two equations, we find 
that 

Si = .1191 
pj = .766. 

V 

For the standard deviation of the original F-values, in 
terms of reciprocals, we secure 

(Tx = .1861. 

y 

(The subscript - is used in connection with each of these 

measures, as they should be distinguished from measures 
based upon natural numbers or logarithms.) 

nSTBRPBBTATION OF THE BTAMUARD EBBOB OF ESTIMATE 

How may we interpret these results? As in all former 
problems of this type the equation gives us a means of 
estimating Y from a known value of X. The standard 
error Si serves as a measure of the reliability of such 

V 

estimates, and pi^ is an abstract measure of the degree of 

y 

relationship between the two variables. But in the present 
case all these measures are in terms of reciprocals. The 
equation enables us to estimate the reciprocal of Y, the 
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standard error has significance only in the form of a recipro- 
cal, and the value of p depends upon the relation between 
two measures (5i® and o-i®) both of which are in terms of 

V V 

reciprocals. 

An illustration may make these meanings clear. If, in 
a given year, the production of oats is 150 per cent of 
trend, what is the most probable price? Substituting in 
the equation 

Y = - •1357-1-1.1643Z 
a value of 1.50 for X, we have 

4 = 1.6108 


Y = .621. 


We may expect a price approximately 62 per cent of trend. 
As a measure of the reliability of this estimate, we have 

>Si = .1191. 

V 

This must be applied to the estimate in terms of reciprocals. 
Thus we have 

1.6108 -t- .1191 = 1.7299 
1.6108 - .1191 = 1.4917. 


Reducing these reciprocals to natural numbers we secure 
.578 and .670 as the desired values. The most probable 
price, then, is 62.1 per cent of trend, and, on the assump- 
tion of an approximately normal distribution of reciprocals 
about the curve, the odds are 68 out of 100 that the price 
will fall between 57.8 per cent of trend and 67.0 per cent 
of trend. The limits of 2S and ZS may be sinailarly deter- 
mined by adding to and subtracting from the estimate, 
as a reciprocal, amoxmts equal to twice .1191 and three 
times . 1191. The results secured may then be converted 
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to natural numbers. Just as with logarithms, the value 
in absolute terms of a given difference between reciprocals 
varies at different points within the range of F-values. 
Accordingly, the limits of reliability determined from >Sj 

V 

should be expressed in natural numbers only after a particu- 
lar estimate has been made. 

A COMPAEISON OF MEASURES OP RELATIONSHIP 

In interpreting p similar considerations enter. The value 
of the index of correlation, as we have seen, depends upon 
the degree of variation about the curve, as compared with 
the variation about the average of the original dependent 
series. In ha-ndling natural numbers, variability about the 
fitted fine is compared with the variability about the 
ariihmdic mean of the dependent variable, both measured 
in absolute terms (i.e., Sy is compared with (r„). In handling 
logarithms, variability about the fitted line is compared 
with variability about the arithmetic mean of the loga- 
rithms of the dependent series, variability being measured 
in each case in terms of logarithms. But logarithmic 
deviations, as we have seen, may be interpreted in terms 
of ratios. The logarithmic deviations from the line represent 
the ratios of actual values to computed, while logarithmic 
deviations about the arithmetic mean of the logarithms of 
the original series represent the ratios of the actual values 
of the dependent series to their geometric mean. The value 
of pioev depends upon the relation between these respective 
deviations (i.e., Siogy is compared with <riog»). 

In fitting a curve in which the reciprocals of the dependent 
variable are employed, variability about the fitted line is 
measured in terms of reciprocals, and the variability of 
the original series is measured in the same terms. That 
is, Cl is computed from the differences between the recipro- 

V 

cals of the actual values and the arithmetic mean of all 
these reciprocals. But the arithmetic mean of these recipro- 
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cals is the reciprocal of the harmonic mean. Thus, in 
short, the value of the index of correlation, pi, depends upon 

V 

the relation between variability about the fitted line and 
variability about the harmonic mean of the dependent series, 
variation in both cases being measured in terms of reciprocals 
(i. e., is compared with <ri). 

V V 

We have, therefore, three broad families of curves for 
describing the relationship between variable quantities. 
These are: 

1. Curves in the fitting of which natural values of the dependent 

variable are employed. Equations to all curves of this family 
will be of the t 3 q)e 

Y=f(X). 

2. Curves in the fitting of which logarithms of the dependent 

variable are employed. In all such cases the equations will be 
of the type 

iogy=/(z). 

3. Curves in the fitting of which reciprocals of the dependent 

variable are employed. For these curves the equations will 
be of the type 

Y = f(.X). 

In any one of these three cases the equations may be 
linear or non-linear. In so far as this problem of interpreta- 
tion is concerned, there is no limitation as to the function 
of X which may be employed. (The computation of S 
and p by the methods suggested above involves certain 
limitations, which are outlined elsewhere.) 

The standard error of estimate for the first family of 
curves is derived in terms of the original units of measure- 
ment (for the dependent variable) and has a direct and 
simple meaning in these terms. The index of correlation, 
for curves of this type, is a measme of the degree to which 
the dbaohUe variaMlUy of the dependent variable may be 



58G THE PROBLEM OF ESTIMATION 

lessened by measuring deviations from the fitted curve 
instead of from the arithmetic mean. 

The standard error of estimate for the second family of 
curves is derived, by the method outlined, in terms of 
logarithms. It is more convenient in general to give it mean- 
ing in terms of ratios. The index of correlation, piojviog®, 
is a measme of the degree to which the logarithmic or 
ratio variability of the dependent variable may be lessened 
by computing deviations (or ratios) with the fitted curve 
instead of the geometric mean as base. 

The standard error of estimate for the third family of 
curves is derived by the same process as in the other cases, 
but emerges as a redprocal. The index of correlation, pj , 

I/® 

is a measure of the degree to which the variability of the 
dependent variable, in terms of reciprocals, may be lessened 
by computing reciprocal deviations from the fitted curve 
instead of from the harmonic mean. 

PACTOES GOVEHNING THE CHOICE OP MEASIJBEB OP 
HELATIONSHIP 

It is clear, therefore, that the choice of a type of curve 
to describe a given relationship must be governed by basic 
considerations as to the type of average which is most 
appropriate as a measure of the central tendency of the 
given series. And this brings in a related question as to 
whether the dispersion about this average more nearly 
approximates the normal type when measured in absolute 
terms, in logarithms, or in reciprocals. In selecting a 
curve and in using the measures S and p there is always 
present an implicit assumption with respect to these points. 

When absolute values are important, and the dispersion 
of the dependent variable approaches the normal type when 
plotted on an arithmetic scale, measures of relationship of 
the arithmetic type would appear to be appropriate. But, 
as we have seen, m handling series in which rates of change 
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rather than absolute amounts of change are of primary 
importance and the dispersion appears to follow a geometric 
law, the arithmetic mean and other arithmetic measures 
are notoriously inadequate. In such cases logarithmic curves 
seem preferable to arithmetic, and measures of the reliability 
of estimates and of degree of relationship which are based 
upon roMos seem to be more suitable than those based upon 
absolute values. 

The harmonic mean has not been so widely employed as 
either of the above averages, and some attention may be 
given to principles governing its use in problems of the 
t 3 q)e here considered. In general, such harmonic measures 
are marked by the same weaknesses as the arithmetic, 
except that they err in the opposite direction. Geometric 
measures are perhaps better adapted to all-around employ- 
ment than either. Yet in one particular field of interest 
to the economist the harmonic mean is particularly appro- 
priate, and the utilization of reciprocals, as in the preceding 
example, seems to be justified. 

The use of the harmonic mean assumes a normal distribu- 
tion of reciprocals which, in natmal numbers, means a 
much wider scatter above the average than below. The 
use of a curve of the type 

P = a + 6 Y 

involves a similar assumption as to the relation between 
Y and X. A given absolute increase in X will be accom- 
panied by a certain decrease in the value of Y. The same 
absolute decrease in X will be accompanied by an increase 
in the value of Y which is larger than the decrease registered 
in the preceding case. But this is the relation which prevails, 
for many commodities, between the amoimts produced and 
the price, the latter considered dependent. A given increase 
in production will cause some lowering of price. An equal 
decrease will cause a much greater increase in price. 
Moreover, when averaging the prices of such commodities 
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over a period, the harmomc mean may give a more typical 
value than any other average.^ In such cases there is a 
strong o 'priori justification for using a curve of the reciprocal 
type and measuring the accuracy of all estimates in terms 
of harmonic relations. 


AEITHMETIC, GEOMBTHIC, AND HARMONIC MEASURES 

The contrast between these different methods may be 
brought home most effectively by comparing the results 
obtained when curves of these three types are fitted to the 
same data. The computations involved in fitting curves 
of the second and third types (logarithmic and reciprocal) 
have been illustrated with reference to the data of oat 
production and prices (Table 132). A straight line (arith- 

^ “Buyers and sellers of potatoes are frequently mistaken as to the price 
justified by fimdamental economic conditions. If such an error is general in the 
faU, it may happen, for example, that the price which results is too high. If the 
price is too high in the early part of the season, potatoes will not be consumed 
fast enough to dispose of the supply available. Farmers and dealers will then 
find that not all of the stocks on hand can be sold at existing prices. Since 
potatoes can not be earned over from one year to the next, the price, under 
such conditions as have been mentioned, must bo lowered enough to permit 
the supply to be disposed of before the end of the season. A properly adjusted 
price would remain the same throughout the season, except for a gradual ad- 
vance to cover cost of storage, and would maintain a fairly uniform consump- 
tion throughout the season. But since an abnormally high price early in the 
season causes small consumption, it must be compensate by on abnormally 
low price during the remainder of the season, or not all the crop can be sold. 

“Similarly, if the price is abnonnally low early in the season, the supply will 
be exhausted too rapidly and those who still have potatoes will find that they 
can get abnomoally high prices for them during the remainder of the season.” 

But how, given the abnormally high or abnormally low prices during part of 
a season, may we compute the average price which would be justified by the 
true conditions of demand and supply, if these had been correctly estimated? 
Since “a low price during part of a season will be compensated only by a dis- 
proportionately high price during the remainder of the season” the arithmetic 
average for an entire season “will be somewhat higher than the average which 
would have resulted had a proper price been established at the beginning of the 
season. This diffumtiy ts diminaM by taking the harmonic mean of the mmthly 
prices/^ 

Holbrook Working, Factors Determining the Price of Potatoes in St, Paid and 
Minneapolis, Teclmical Bulletin 10, University of Minnesota Agricultural 
Experiment Station, 8-10. 
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metic) is fitted to the same data, and the necessary accom- 
panying measures computed. The three sets of results are 
brought together in Table 135. 


Table 135 


BelcMon between the Production and Price of Oats, 1881-1913 
Comparison of ResuUs of Curve Fitting 
(Prices are the dependent variable in each case) 

iSftondarrf error Index of 
^ of estimate correlation 

Y = 2,24 - 1 236Z = . 12 = .783 


- = - .1357+ 1.1643X 


;Sfi=.1191 PI = .766 

y 

C Log 7 00861 - 1 18206 log Z Siogy = .04901 = .793 


It is impossible to compare the three standard errors as 
they stand, since only the first one is in the original units 
of measurement (ratio of actual price to trend). In the 
following table are given estimates, based on each of these 
equations, as to the most probable price (in terms of ratio 
to trend) which would accompany each of five different 
conditions of production.^ Each estimate is accompanied 
by a series of values which indicate the limits set by the 
standard error. Throu^out, the values of the estimates 
plus and minus S, 25, and 35 are given, in order to indicate 
the probable scatter of actual values about the estimates. 
The different amoimts of variation which may be expected 
about each of the three lines of relationship are measured 
by the actual differences' between the estimates and the 
limiting cases. These differences are given in the columns 
headed A. All values in this table are comparable, being 
reduced to the original units (ratio of actual price to trend). 

^ For the purpose of this illustration the limits of actual observation have 
been exceed^ in setting up Table 1^. Such extrapolation involves the po&-< 
sibility of errors of another sort. With these we are not here concerned. 
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Table 136 


Comparison of Price Estimates and of Standard Errors of Estimate 
Based on Three Equations Relating to the ProdwMon and 
Price of Oats 


(1) 

Value 

ofX 

(ratio 

of 

pro^ 
iw- 
tton to 
nor- 
nuO) 

(2) 

BsUmated 
value of Y 
(ratio of 
price to 
trends 
from 
arUlmetie 
eouatlon 
U) 

(8) 

Limits of 
arithmetic 
estimate 

(4) 

A 

(6) 

Estimated 
value of Y 
from 
rerAprocal 
equation 
(B) 

(6) 

Limits of 
estimate, 
reciprocal 

(7) 

A 

(8) 

Estimated 
value of Y 
from 
logariOir 
mie 

equation 

(O 

(9) 

Limits of 
logarithmic 
estimate 

(10) 

A 



+8S«1 982 

+ 86 


+38 = 11 228 

+8 983 


+38=3 114 

+ 890 



+2S-1 862 

+ 24 


+28- 4 803 

+2 568 


+28 =2 780 

+ 666 

5 

1 622 

+S-1 742 

+ 12 


+8= 3 056 

+ 816 

2 224 

+8 -2 491 

+ 267 



-8=1 602 

- 12 


-S= 1 768 

- 472 


-8=1 979 

- 245 



-28 = 1 382 

- 24 


-28= 1 461 

- 779 


-28 = 1 779 

- 446 



-38 = 1 262 

- 36 


-88= 1.244 

- 996 


-38=1 679 

- 646 



+38*1 611 

PjM 


+38 =2.281 

+1 024 


+38 = 1 786 

+ 610 



+28-1 491 



+28 = 1 794 

+ 637 


+28 = 1 696 

+ 319 

8 

1 251 

+8 = 1 371 

EjTI 

1.267 

+8 = 1 478 

+ 221 

1 276 

+8=1 429 

+ 163 



-8 = 1 181 

- 12 


-8 = 1 098 

- 164 


-S-1 136 

Kn] 



-28 = 1 011 

- 24 


-28= .967 



-28=1 021 

- 255 



-88 = 891 

- 86 


-38= 867 

- 390 



- 870 



+88=1,864 

+ 36 


+38=1 490 

+ 618 


+38=1.872 

+ .392 



+28 = 1 244 

+ 24 


+28 = 1 266 

+ 293 


+28 = 1.226 

+.246 

1 0 


+8 = 1,124 

+ 12 

.972 

+8 = 1 100 

+ 128 

980 

+8 = 1 098 

+ 118 



-8= 884 

- 12 


-8= 871 

- 101 


-8= 872 

-.108 



-28= 764 

- 24 


-28= .789 

- 183 


-28- 784 

-.106 



-38 = 644 

- 86 


-88 = 722 



-88 = 696 

- 284 



+88-1 117 

+ 86 


+88=1 106 

+ 813 


+88 = 1 106 

+ 316 



+28= .997 

+ 24 


+28 = 977 

+ 184 


+28= 987 

+ 197 

1 2 

767 

+8= .877 

+ 12 

.793 

+8= 876 

+ 082 

790 

+8- 886 

+ 096 



-8= ,687 

- 12 


-8= 724 

- 069 


-8- .703 

-.087 



-28= .6171 

- 24 


-28= 667 

-.126 


-28 - 632 

-.168 



-88= .897 

- 36 


-38= 618 

1 

- 176 


-88 = 661 

- 229 



+88= 746 

+ 36 


+88 = 798 

+ 177 


+88 -.862 

+ 246 



+28= 626 

+ 24 


+28 - 728 

+ 107 


+28«.761 

+ 164 

1 6 

386 

+8- 606 

+ 12 

.621 i 

+8 = ,670 

+ .049 

607 

+8= 680 

+ 073 



-8= 266 

- 12 


-8= 678 

- 043 


-8- 542 

- 066 



-28= 146 

- 24 


-28 = 641 

- 080 

i 

-28= 484 

- 123 



-38 = 026 

- 86 


-88 = 608 

-.113 

1 

-88= 433 

-.174 


Zones of Estimate and Theik Significance 

A careful study of this table should make clear the nature 
of estimates based on the three types of equations here 
presented. The fundamental differences lie not so much 
in the actual values of the estimates, as in the standard 
errors which measure the reliability of these estimates and 
indicate the limits within which the actual values are likely 
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Production Ratio 


Fiq. 87. — The Relation between the Production and Price of Oats: 
Illustrating the Use of an Arithmetic Equation of Regression and Arith- 
metic Zones of Estimate 

to fall. In other words, the differences lie in the assumptions 
made as to the character of the scatter about the curves. 

The measure Sy, which relates to the arithmetic curve, 
gives the same absolute range to errors of estimate whether 
the estimated value be high or low. An arithmetic dispersion 
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Fig. 88. — The Relation between the Production and Price of Oats: 
Ulustrating the Use of a Logarithmic Equation of Regression and Geo- 
metric Zones of Estimate 


about the curve is assumed. In each case the estimate 
is the arithmetic mean of the value which exceeds the 
estimate by an amotmt equal to Sy (or any multiple of Sy) 
and the value which falls below it by an equal amoimt. 
These conditions are brought out graphically in Fig. 87. 
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The original points are plotted, the straight line of relation- 
ship (arithmetic) is shown, and zones of estimate having 



Fig. 89. — The Rdation between the Production and Price of Oats: 
niustrating the Use of a Logarithmic Equation of Eegression and Geo- 
metric Zones of Estimate (Plotted on Double Logarithmic Paper) 


widths, respectively, of 25, 45, and 65, centering at the 
fitted line, are marked out. 

The measure 5iog y gives the same relative or percentage 
range to errors of estimate, whether the estimate be high 
or low. This means that the absolute range within which 
the actual values should fall is much less when the estimates 
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are low than when they are high. It assumes a geometric 
dispersion about the curve which describes the relationship. 
The estimate is, in this case, the geometric mean of the 
value which exceeds it by an amount equal to Sjogv (or 
any multiple of S-i^y) and the value which falls below it 
by an equal amoimt. Fig. 88 presents these relationships 
graphically. The original data are here plotted, together 
with the graph of the equation 

Y = .9804X-‘‘““. 

There are shown, also, the limits of zones of estimate having 
widths equal, respectively, to 2Sr, iSr, and QSr, centering 
(geometrically) at the line of relationship. A comparison 
of Fig. 87 and Fig. 88 will reveal the differences between 
estimates based on the assumption of an arithmetic distribu- 
tion and those based on the assumption of a geometric 
distribution. 

The points and Unes shown in Fig. 88 are plotted on a 
logarithmic scale in Fig. 89. On this scale the curve of 
relationship becomes straight, and the zones of estimate 
appear as symmetrical and of equal width throughout the 
range. This transformation when the data are plotted on 
logarithmic paper makes clear the fundamental simplicity 
of the assmnptions involved in making estimates from 
logarithmic values. 

In using the measure Si we carry still further the assump- 

V 

tion that the variability about the curve is greater with 
high prices than with low. It shows a very limited range 
to errors of estimate when the estimate is low and a very 
wide range when the estimated price is high. A harmonic 
dispersion about the curve is assumed. The computed 
value, or estimate, is always the harmonic mean of the 
value which exceeds it by an amoimt equal to Si (or any 

V 

multiple of Si) and the value which falls below it by an 

V 

equal amount. 



ZONES OF ESTIMATE 


595 



Fig. 90. — The Relation between the Production and Price of Oats: 
TUustrating the Use of an Equation of Regression Based upon Reciprocals, 
and of Harmonic Zones of Estimate 


In Kg. 90 the curve ^ ~ • 1357 + 1 . 1643X is plotted, 

together with the original observations. Zones of estimate 
with widths of 2Si, 45i, and 6Sy centering (harmonically) 

V V V 

at the fitted line, are shown. The differences between this 
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figure and each of the two preceding are quite marked, 
particularly with respect to the zones of estimate. On the 
assumption of a normal harmonic distribution about the 
curve describing the relationship, the outer zone (with width 
equal to 6S) marks the limits within which 99.7 per cent 
of all the points should fall, and the inner zone (with width 
equal to 2 jS) marks the limits within which 68 per cent 
of aU the points should fall. By plotting reciprocals through- 
out, instead of natural numbers, this apparently abnormal 
distribution could be reduced to the symmetrical form 
secured in plotting the geometric values on the logarithmic 
chart. 

For both high and low estimates the geometric measure, 
Slog V, stands between the arithmetic measure, S», and the 
harmonic measure, Si. While the two latter have their 

V 

particular functions, and are appropriate in certain cases, 
it is probably true that in using such methods as these in 
economic analysis, measures of the geometric family are 
more generally useful than those of the other types. This 
means, merely, that ratios are usually more important 
than absolute differences. It seems reasonable therefore 
to base estimates upon an equation of the type 

Logy=/(X) 

and to measure the reliability of these estimates in terms 
of logarithms or ratios, using Siog u or Sr. In such cases, as 
we have seen, correlation is measured by /^„iog, or piogyx. 
The value of this index depends upon the raiio variability 
about the curve, as compared with the raiio variabUity 
about the geometric mean.^ 

1 The reasoning in C. M. Walsh’s book, The PrMem of EstimaiUm (London, 
King, 1921, p. 12.) is peculiarly applicable to the present problem. Citing Gali- 
leo, in defence of the use of the geometric mean in averaging estimates, Walsh 
writes: "And so errors must be measured by an error which is a ratio ietween 
the esHnuOe and the true gpumUty, and not a concrete quard^ itsdf. We cannot 
measure errors by so many pounds, feet or crowns; we must measure them 
by the proportions of the pounds, feet or crowns in the erroneous estimates to 
the pounds, feet or crowns in the thing estimated.” (Italics mine.) This ar- 
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guinent bears out powerfully what has been said as to the use of logarithmic 
functions m estimating, and as to the employment of logarithmic measures 
of errors of estinaate. 



CHAPTER XVIII 


STATISTICAL INDUCTION AND THE PROBLEM 
OF SAMPLING, CONCLUDED 

The methods of induction discussed in an earlier section 
(Chapter XIV) dealt with the more familiar procedures 
employed in generalizing results secured from the study 
of samples. Certain research problems call for modifications 
of the methods there described, while for some purposes 
quite different iostnunents are needed. In the present 
chapter, therefore, we carry forward the discussion of 
statistical inference, considering methods appropriate to 
certain special conditions and special problenos. 

Genebalizino fbou Small Samples 

The standard error of an arithmetic mean, we have seen, 
is given by 

where N is the number of observations in the sample and 
O' is the standard de'vlation of the population from which 
the sample is drawn. We do not know the standard deviation 
of the population but we approximate it from the standard 
deviation of the sample. (For convenience in this exposition 
we shall use s as a sjrmbol for the standard deviation of 
the sample; <r 'will denote the standard deviation of the 
population.) This is an acceptable approximation when 
N is reasonably large, say 30 or more. But for small values 
of N the standard deviation of the sample is subject to 
a definite bias, tending to make it consistently lower than 
the standard deviation of the population. The value of 
ffu derived by the customary method is also biased down- 
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ward- Therefore, when methods appropriate to large 
samples are employed with small samples, we consistently 
imder-estimate the sampling errors to which our measure- 
ments are subject. This bias shows remarkable consistency, 
however. With samples of any stated size the magnitude 
of the error to be expected from the use of the standard devia- 
tion of the sample as an approximation to the standard devia- 
tion of the population may be determined, and correction 
made for it. Accordingly, generalization of results secured 
from small samples is possible. In the nature of things the 
margin of error in such generalization is larger than it is when 
large samples are used, but the distortion due to sheer 
bias may be avoided.^ 

The natme of the error involved in generalizing from 
small samples may be brought out in the following terms. 
If we represent by M the mean of the population from 
which a sample is drawn, by X the mean of a single sample, 
and by a- the standard deviation of a distribution of a 
number of X’b computed from successive samples, we may 
write 

y . X- Af 

The quantity T is the deviation of the mean of the sample 
from the mean of the population, expressed in units of the 
standard deviation of the sample means. When <r- is 
^termined from the actual distribution of a number of 
X’s, or from the true standard deviation of the population 
and N of the sample, the quantity T may be interpreted 
as a normal deviate. The significance of given values of 
T may then be detemoined with reference to a table of 
areas imder the noimal curve. Actually, we do not 
have a large number of X’s, which may be arranged in a 
frequency distribution, nor do we know the value of <r 

^ The bias involved in the use of « as an approximation to <r, for small sam- 
ples, was first discovered by “Student.” For the original memoir see “The 
Probable Error of the Mean,” Biometrika, Vol. 6, 1908, 1-26. 
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(the standard de'mtion of the population), nor of or- (the 
standard error of X). We approximate o- by s (the standard 
deviation of the sample) and Cj. by what we may call Sj 


” y/N 1 ^ computed from ~ 


if s has been computed from 




When these ap- 


proximations are based upon small samples, the T derived 
from them may not be interpreted as a normal deviate. For 
the distribution of T varies with the size of the sample. With 
sToall samples the distribution departs significantly from the 
normal type. Statistical inferences that fail to talce accoimt 
of this are inaccurate. 

A discussion in detail of the distributions of statistical 
measurements obtained from small samples would carry 
us beyond the scope of the present book. We may briefly 
note, however, certain characteristics of the distribution 
function of the standard deviation. These are effectively 
revealed by the results of an interesting experiment con- 
ducted by W. A. Shewhart. 

Shewhart drew 1,000 samples, each consisting of four 
observations, from a normally distributed parent population 
with a known standard deviation, equal to unity. ^ The 
standard deviation, s, of each sample was computed. The 
distribution of these thousand values of s is represented by 
the dots in Fig. 91.® (The line running through the dots 
defines the theoretical distribution of s’s to be expected, 
with samples of 4, on the basis of “Student’s” theory. 
There is a notably close agreement between the theoretical 
and observed distributions.) Traditional sampling concepts 
would lead us to expect a normal distribution of s’s, center- 
ing about 1, the value of <r in the parent population. 
Instead, the distribution is definitely skew, with the meas- 

^W. A. Shewhart, Economic Control of QuoMtif of Manvfact/ured Product, 
New York, Van Noatraad, 1981, 163-173, 185-186. 

> The figure is here reproduced with the permission of Dr. Shewhart and his 
publishers. 
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urements clustering about a central tendency well below 
unity. The mode of the thousand values of s here repre- 
sented is, in fact, .717 and the arithmetic mean is .801. 
These s’s, it will be recalled, represent estimates of a. 



0.075 0.375 0.675 0.975 1.275 1.576 1.876 


Standard Deviation 6 

Fig. 91. — Distribution of Standard Deviations in Samples of Four Drawn 
from a Normal Universe 

There is a clear tendency for such estimates, based on 
samples of four, to understate the true value. 

The symbol T has been used above to define the deviation 
of a statistical measure from some standard or hypothetical 
value, expressed in units of the estimated standard error 
of the measure in question, when the deviation, so expressed, 
could be interpreted as a normal deviate. In the present 
exposition we shall employ the symbol t to relate to approxi- 
mations to T when these approximations are based on 
small samples. 

The difference between T i may be reduced to more 
definite terms. If we let a: = X — ilf , we may write 
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We may derive t from T: 

t = 

<Tx (Tx 55 

The normally distributed quantity, xl<ri, has been divided 
by the factor Ss/cr*, to give the quantity t. Opportunity to 
correct for the bias is given us, however, by the fact that 
the distribution of Si/crz is known. Thus the probability 
corresponding to any stated value of t may be determined 
(when t defines a departure from a certain hypothetical 
value, measured in xinits of s*).^ 

It is of some interest to compare values of t corresponding 
to stated probabilities, for samples of varying sizes, with 
values of T corresponding to the same probabilities. This 
is done in Table 137. 

The familiar values given in the customary table of 
areas under the normal curve appear on the last line of 

1 The degree of error involved m using s as an approximation to tr, for small 
samples, is indicated by the following figures, taken from W. A. Shewhart 
Qoc. dt, 185). They define the relation between the modal s, for samples of 
size N drawn from a population of which the standard deviation is known, and 
the true o- of that population. 

Size of sample Modal s as a decimal fraction of true a 

N 


3 

.577 

4 

.707 

6 

.775 

6 

.817 

7 

.845 

8 

.866 

9 

.882 

10 

.894 

15 

.931 

20 

.949 

25 

.959 

30 

.966 

60 

.980 

100 

.990 


The fractions given above define relations that are to be expected on the 
basis of error theory, as modified by “Student” to take account of conditions 
affecting small samples. The modal value of the 1,000 standard deviations 
obtained by Shewhart in his empirical test of this theory was, as we have seen, 
. 717 of the standard deviation of the universe. This result is very close indeed 
to the expected value of . 707, for samples in which iV = 4. 
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Table 137 

Values of t and T Corresponding to Stated ProhabtlUies ^ 


n Probability 



80 

.50 

40 

.20 

.10 

.05 

.01 

1 

325 

1 000 

1 376 

3.078 

6 314 

12 706 

63.657 

2 

289 

816 

1.061 

1.886 

2.920 

4.303 • 

9.925 

3 

277 

765 

.978 

1 638 

2 353 

3 182 

5.841 

4 

271 

741 

.941 

1 533 

2 132 

2 776 

4 604 

5 

.267 

727 

920 

1.476 

2 016 

2 571 

4.032 . 

6 

265 

.718 

906 

1 440 

1 943 

2 447 

3.707 

7 

263 

.711 

896 

1 415 

1 895 

2 365 

3.499 

8 

262 

.706 

.889 

1 397 

1 860 

2.306 

3.355 

9 

.261 

703 

883 

1 383 

1.833 

2 262 

3.250 

10 

.260 

.700 

879 

1 372 

1 812 

2.228 

3.169 

20 

.257 

.687 

.860 

1 325 

1.725 

2 086 

2.845 

30 

256 

.683 

.854 

1 310 

1 697 

2 042 

2.750 

00 

.25335 

.67449 

84162 

1.28155 

1.64485 

1.95996 

2.57582 


Table 137, for n = « . These are the values of T, as a nor- 
mal deviate, corresponding to probabilities of .80, .50, etc. 
Thus, when we are dealing with infinitely large samples, 
the probability of a given sample yielding a value of T 
as great as .25335 or greater (either above or below the 
mean) is .80. (The area between the maximum ordinate 
and an ordinate erected at + .25335 is 10 per cent of the 
total area imder the normal curve. Twenty per cent of 
the total area will fall within db .25335, and 80 per cent 
win fall beyond these limits.) Similarly, just 50 per cent 
of the values of T will exceed the limits ± .67449 ; 5 per 
cent will exceed the limits d= 1.95996; 1 per cent will 
exceed the limits ± 2 . 57582. 

As n grows smaller each of these limits must be extended, 
if the probabilities are to remain constant. For samples 
in which n is equal to 10, 50 per cent of the values of t will 

^ The entries in this table are extracts from a more detailed table (Table IV) 
in R. A. !Fisher*s Statistical Methods for Research Workers, Edinburgh, Oliver 
and Boyd, sixth edition, 1936. The table is printed here through the courtesy 
of Dr. Ksher and his publishers. 
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fall beyond the limit , a db . 700; 5 per cent will exceed the 
limit , a ±2.228, and 1 per cent will exceed the Umits 
± 3 . 169. (The letter n in Table 137 refers to the number 
of degrees of freedom in the computation of t. This general 
concept has been discussed in Chapter XV. When the 
arithmetic mean of a sample is being tested for significance, 
n = N — 1.) If in applying various statistical tests we attach 
significance to a given level of probabilities, such as 5/100 
or 1/100, we must recognize that the values of t corre- 
sponding to these probabilities vary with n. Fortunately, we 
now know how these values vary and, using such a table as 
that given above, may make allowance for the variation. 

For convenience in exposition we have distinguished T, 
as a normal deviate, from t, a similar deviate relating to a 
distribution of quantities derived from small samples, and 
therefore not normal. The probabilities corresponding to 
a given value of T are not the same as the probabilities 
corresponding to an identical value of t. Indeed, these 
probabilities vary for the same value of t computed from 
samples of different sizes. The distinction between T and 
t need not be preserved, however. We may use t generally 
to define the deviation of a statistical measure from some 
standard or hypothetical value, expressed in units of the 
standard error of the measure in question. The quantity 
^ is to be interpreted as a normal deviate when large samples 
are dealt with. The iaterpretation is modified ia dealing 
with small samples, as we have seen. The nature of the 
modification required is shown by the entries in Table 137 
and in Appendix Table II. 

EXAMPLES OF TESTS BASED ON ^^ABLE 

In determining whether the mean of a sample deviates 
significantly from any stated value we may compute t 
from the relation 

, X- Af 
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where X is the mean of the sample, M is the stated vahm 
and Si is an approximation to the standard error of X. 
For this approximation we have 



where s is the standard deviation of the sample (here com- 


puted from 



The value t, which for larger sam- 


ples we have interpreted with reference to a table of areas 
under the normal curve, we here interpret with reference 
to the special i-table for small samples. In using the i-table 
for this purpose we take n of that table as equal to JV — 1. 

For the six New England states the average earnings 
of factory workers in 1935,^ as indicated by census returns, 
were as follows : 


Maine 

$ 851 

New Hampshire 

892 

Vermont 

940 

Massachusetts 

1,007 

Rhode Island 

938 

Connecticut 

1,016 

Average 

$ 940.67 


For s we obtain the figure $63.99. 

s $63.99 


the mean is ss = 


The standard error of 
= $26.13. 


VN a/6 

Does the average of annual earnings of factory workers 
in the six New England states differ significantly from 
$1,022, the average for the country as a whole? Computing 
t we have 

x-M _ $940.67 - $1,022 


t = 


$26.13 


= -3.11. 

1 These averages, and afinilar ones cited bdow, are derived by dividing the 
total wages paid by the average number of wage-earners employed during the 
year. Part-time workers are induded. The averages do not represent full-time 
earnings, therefore. 
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Consulting the i-table with » = 5 we find that for P = . 01, 
t = 4.032. The observed deviation is not as great as this. 
If our standard is a P of .01, the average for the New 
En^and states is not to be judged significantly less than 
the average for the country as a whole. If the standard 
were a P of . 05, however, the deviation would be considered 
significant. 

Similarly, we may test with reference to the i-table the 
significance of a difference between two means, computed 
from amflll samples. In this case we obtain t from the 
relation _ 

, - Z, , / NxN^ 

* ~ s y Ni + Ni 

where the Z’s and iV’s have the customary meanings, and 
s is, in effect, an average standard deviation of the two 
distributions. For 


s 



Ni + Ni- 2 


Here di and d* are used, respectively, to denote deviations 
of given observations from the means of the two distribu- 
tions. The value t, as derived above, corresponds tot = — , 

(Td 

where D is the difference between two means and (Td is 
the standard error of that difference. For small samples, 
however, the customary formula for <td is modified some- 
what, and the special t-table rather than the table of normal 
deviates is used. In consulting the ^-table in a problem of 
this type, n is taken as equal to Ni + Ni - 2. 

Average earnings of workers employed in manufacturing 
plants in six Southern states, in 1935, are shown below: 


North Carolina 

$662 

South Carolina 

615 

Georgia 

599 

Tennessee 

744 

Alabama 

640 

Mississippi 

541 

Average 

$633.50 
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Does this average differ significantly from the mean earnings 
in six New England states in the same year? For the com- 
putation of 8 we have 

. . . ^.06 

and for t 

$940.67 - $633.5 /36 
* $66.06 T 12 

= 8.05. 

In the t-table, for n = 10, we find that the value of t corre- 
sponding to a P of . 01 is 3 . 169. The present value is clearly 
significant. The two samples could not have come from 
one homogeneous parent population. 

The stable has particular value in connection with the 
interpretation of coefficients of regression. We may have 
observed that a given variable, Y, appears to increase by 
a constant increment or at a constant rate as another 
variable, x, changes in value. The degree of relationship 
between the two variables may be measured in terms of r, 
the coefficient of correlation, but special interest often 
attaches to the functional relationship and, in particular 
to the apparent regression of y on x. Does b of the equation 
of regression 

Y = a + bX 

depart significantly from zero, or from some other value 
which has significance for the purpose in mind? Here we 
must judge b with reference to the sampling errors to which 
it is exposed. 

A general test of this type was applied in an earlier 
section (Chapter XTV), m seeking to determine whether 
average com yield in Kansas had shown a significant decline 
over the period 1890-1933. For smaller samples we may 
compute t by exactly the methods there presented, but we 
should interpret t with reference to the special stable 
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adapted to small samples. As a general formula we have 

o-t 

where 6 is a coefficient of regression and j8 is a norm with 
reference to which we wish to judge the given value of b. 
For the standard error of b we have 


_ _ ^ 1 / 

where 

y N -2 

(In these expressions, x = X — X, Fisan observed value 
of the dependent variable, and Ye is the corresponding 
computed value.) In interpreting the value of t thus secured, 
the f-table is employed with n = N —2. 

This test may be extended to the comparison of two 
coefficients of regression. The series in Table 138 provide 
an illustration. 


Table 138 

Aggregate Values of Loans on Securities and Commercial Loans, 
Reporting Member Banks, Federal 
Reserve System, 1922-1929 
(In hundreds of millions of dollars) 


Year 

Loans on 

Commercial loans 

securities 

(“ all other loans”) 

1922 

39 

73 

1923 

41 

78 

1924 

45 

80 

1925 

53 

82 

1926 

57 

86 

1927 

62 

87 

1928 

69 

89 

1929 

77 

92 


For loans on securities the trend (i.e., the equation of 
regression of volume of loans on time) is defined by 
Fi = 30.63 + 6. 49^1, 
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The corresponding equation for commercial loans is 
Yi = 72.13 + 2.54X2. 

In each case the origin is at 1921. The eight-year period 
was marked by an increase of loans on securities which was 
much more rapid than the corresponding advance in com- 
mercial loans. We must ask, however, whether the difference 
between the two coefficients of regression is really signifi- 
cant, if accoimt be taken of sampfing fluctuations. 

The coefficients to be compared are 

bi = 5.49 
6* = 2.54. 

In testing whether bi — bi is significant (i.e., deviates 
significantly from zero) we must compute 

i _ bi — bi 

Cbi-b, 

being, of course, the standard error of the difference 
between the two coefficients of regression. For this standard 
error we have 





^(Xi^ 


where Xi and Xi are given values of the two variables, 
expressed as deviations from their respective arithmetic 
means, and 


SCFi- 7«,)* + S(F2- Fe,)* 
“ Ni + N2-4: 


is a measure of the average scatter about the two lines 
of regression. 

In the present example we have 


Sy^ = 2.40 


t - 


bi-bi 


5.49 - 2.54 
.338 ■ 


8.73. 


(Tby-i, 
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For the interpretation of this value of t we enter the {-table 
with n = JVi -b JV 2 - 4 = 12. In this case the value of t 
far exceeds the value of 3.055, corresponding to P = .01. 
The results are not consistent with the hypothesis that 
the true value of 61 — 62 is zero. The trends of the two 
series differ significantly. (Here, again, the reader should 
bear in mind that such tests of significance apply only 
with important qualifications to economic series that are 
ordered in time.) 

Sampling Eeeobs op Cobppicients op Coeeelation 
Computed peom Small Samples 

As a general formula for the determination of the standard 
error of the coefficient of correlation we have made use of 

1 — r® 

VN - 1 

In error theory, the r that appears in the numerator of 
the right-hand member of this equation is the coefficient 
of correlation in the universe from which the sample in 
question is drawn. But this r is not known. Our best 
approximation to it is the r derived from the sample. Here, 
again, we face distortion in small samples, a distortion 
that is the greater the higher the value of the true correla- 
tion. The nature of this bias may be readily understood. 
If we are drawing samples from a universe m which the 
true value of r is + .95, the range of the possible variation 
of the sample r’s above the true r is only .06. But the 
range of possible variation below the true value is 1.95 
(i.e., from + .95 to — 1.00). Accordingly, a distribution 
of r’s obtained from a great many small samples from this 
universe will be sharply skew. An estimate of the true 
value based upon a sample value will be subject to corre- 
sponding bias. This bias will not be present when the 
population value of r is zero. (The distribution of sample 
r’s when the population value of r is zero will be symmetrical. 
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but will depart somewhat from the normal type in other 
respects.) It wiU not be pronotmced when the samples 
are large, even for high values of r. But when samples 
are small and the population value of r departs materially 
from zero, substantial inaccuracy results from the use of 
the formula given above. 

Allowance may be made for this bias by use of the table 
showing the distribution of t, for samples of various sizes. 
R. A. Fisher has shown that the procedure employed in 
deriving t, in testing whether a coefiScient of linear regression 
differs significantly from zero, may be used, with an algebraic 
modification of the mathematical expression, in determining 
the significance of r. If we are testing the hypothesis that 
a sample from which a given r has been computed was drawn 
from a population in which the true value of r is zero, we 
may compute t from the relation 

,_ rVN^ 

Vl — r* 

This is equivalent, of course, to dividing the quantity r — 0 
(i.e., the deviation of the g iv en r fr om the h3q)othetical 
value of zero) by Vl — r^jy/N — 2. In consulting the 
f-table for the interpretation of the values thus obtained, n, 
the number of degrees of freedom, is taken as equal to JV — 2. 

As an illustration, we may test the results obtained from 
a study of the relation between the production and the 
price of cotton in the United States, covering 35 observa- 
tions. The value of r is — .65. We have 


- .65v^5^ 
VI .65)® 


4.91. 


In consulting the i-table we find that for n = 33 the value 
of t corresponding to a probability of 1 per cent ^ is approsd- 


1 This probability refers to the likelihood of deviations above or below the 
assumed true value of zero. It corresponds to the sum of areas at both extrem- 
ities of a frequency curve. We may divide it by two to obtain the probability 
of a deviation of the stated magnitude in one direction only from the hypothet- 
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Table 139 


Values of the Correlation Coefficient for Different Levels of 
Significance ^ 


n 

P= 06 

II 

P= 01 

1 

996917 

9995066 

9998766 

2 

95000 

98000 

990000 

3 

8783 

93433 

95873 

4 

8114 

.8822 

91720 

5 

.7645 

8329 

8745 

6 

7067 

.7887 

8343 

7 

6664 

7498 

.7977 

8 

6319 

7155 

.7646 

9 

.6021 

6851 

7348 

10 

.5760 

6581 

.7079 

11 

5529 

6339 

6835 

12 

5324 

6120 

6614 

13 

5139 

5923 

6411 

14 

4973 

.5742 

6226 

15 

4821 

.5577 

.6055 

16 

4683 

5425 

5897 

17 

4555 

5285 

.5751 

18 

4438 

.5155 

.5614 

19 

4329 

.5034 

5487 

20 

4227 

.4921 

.5368 

25 

.3809 

4451 

.4869 

30 

.3494 

4093 

.4487 

35 

3246 

.3810 

.4182 

40 

3044 

3578 

.3932 

45 

2875 

3384 

.3721 

50 

2732 

.3218 

.3541 

60 

2500 

.2948 

3248 

70 

.2319 

.2737 

.3017 

80 

2172 

.2565 

.2830 

90 

.2050 

.2422 

.2673 

100 

.1946 

.2301 

.2540 


ical Talue. In most problems of the type here discussed it is conservative prac- 
tice to test given results with reference to the probability of a deviation of 
given magnitude, without consideration of the direction of deviation. The 
tabulated values of t lend themselves to this procedure. 

^ This table is printed here through the courtesy of R. A. Fisher and his 
publisher^ Oliver and Boyd, of Edinburgh. The original appears as Table V.A 
of StaUsticdl Methods for Research Worlds. 
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mately 2.73. If the true value of t were zero, a value as 
great as 2 73 or greater would occur only 1 time out of 
100, as a result of chance fluctuations of sampling. The 
present value of t is substantially greater than 2.73. It 
it highly improbable that it reflects a chance drawing from 
a population in which the true value of t (and, of course, 
of r) is zero. The results we have obtained are not, then, 
consistent with the hypothesis that the true value of r is 
zero. There appears to be a significant negative correlation 
between the production and the price of cotton. 

If we are seeking to determine the significance of given 
coefficients of correlation with reference to hypothetical 
values of zero, use may be made of a table prepared by 
R. A. Fisher, showing the values of correlation coefficients 
at stated levels of significance. Selected values from this 
table are given in Table 139 and in Appendix Table III. In 
simple correlation problems, this is to be read with n equal 
to N — 2 (the number of pairs of original observations 
less 2). In determining the significance of coefficients of 
partial correlation the number of variables held constant 
is also subtracted from N. 

The use of the table requires little explanation. If a 
sample is based on 12 pairs of observations, with n equal 
to 10, we woiild require a coefficient at least as high as 
.7079 before we accept it as significant, if our standard 
of significance is F = .01. For only 1 time out of 100 
trials would a sample of 12 drawn from an uncorrelated 
population yield a value of r as great as .7079. If our 
standard of significance is P = .05 we would accept as 
significant of a real relationship an r of .5760, or greater, 
obtained from a sample of 12. 


TRANSFOKMATION OF r TO Z 

The sampling limitations attaching tor have led R. A. Fisher 
to utilize as a general measure of linear correlation a logar 
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rithmic function, of r that possesses certain distinctive merits. ^ 
In effecting the transformation we have 

2 = i {log.(l + r) - log,(l - r)}. 

Conversely 

T = (e** - 1) + 1). 

The scales of possible values of r and z are, of course, quite 
different. For r = 0, z = 0, and for r = 1, z = oo . Negar 
tive values of r give negative values of z. The relations 
between the two functions, at different levels of correlation, 
are shown by the entries in Appendix Table IV. Transfor- 
mation may be more readily effected by means of this 
table than from the relations given above. 

There are certain highly important advantages in this 
transformation. Not least is the replacing of r by a function 
with a distribution of values corresponding more closely 
to the true significance of observed correlations than do 
those of r. Thus a change in the value of r from .88 to 
.98 is equivalent, on the r scale, to a change from .20 to 
.30. But the first of these differences represents, on the 
z scale, a change from 1.38 to 2.30 (a range of .92) while 
the second represents a change in z from .20 to .31 (a 
range of .11). The first difference, on the z scale, is over 
8 times more significant than the second. In this the z 
scale gives a far more accurate representation of the true 
significance of observed correlations than does the r scale. 

More important than this, however, is the fact that the 
distribution of z is much closer to the normal type than is 
that of r; in particular, the distribution of z is not subject, 
as is that of r, to marked variations m form with variations 
in the degree of correlation in the population. The form 
of the distribution of z is virtually independent of the 
degree of correlation. As a result, the sampling errors to 
which z is exposed may be estimated with considerable 

^ See StatisticaL Metiwds for Research WorkerSf Chapter VI. 
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accuracy. For the standard error of z we have 

1 


This standard error, it is to be noted, is a function solely 
of 'N, It is independent of the true value of z in the parent 
population. 

From the example in Chapter XVI we obtained a coeflB.- 
cient of partial correlation of — .2923 between corn yield 
per acre in Kansas and average June temperature, holding 
constant effects of changes in July and August tempera- 
tures. Referring to Appendix Table IV we have, for 
r = — .2923, z = — .301. In computing the standard 
error of a coefficient of partial correlation we must subtract 
from N the number of variables held constant. Since iV 
equals 44 in the example in question, we treat the coefficient 
of partial correlation as we would a simple coefficient based 
on 42 observations. For the standard error of z we have, 
then. 


1 

(T z .1 

V42 - 3 


.160. 


With reference to this result we may determine whether z 
differs significantly from zero. For the test we must have 


z - 0 - .301 

(T, .160 


1 . 88 . 


We interpret 1 . 88 as a normal deviate. It is clear that it 
is not large enough to indicate that z is significant. The 
result is not inconsistent with the hypothesis that the true 
value of z (and hence of r) is zero. 

If, however, we test the coefficient ru.ss = — .4057, from 
the same example (defining the relation between com yield 
per acre and August temperature, with June and July tem- 
peratures held constant), we have 

z - 0 - .430 
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This result is clearly signihcaiit. So, also, is the measure 
^ 18.24 = — 6101, the coefficient of partial correlation be- 
tween corn yield and July temperature, with June and 
August temperatures held constant. 

The procedure would be similar, of course, if we were 
testing the significance of the deviation of an observed 
value of 2 from a theoretical value other than zero. 

The transformation to z makes possible, also, an accurate 
test of the significance of the difference between two 
observed correlations. The standard error of the difference 
between two values of 2 is given by 

(Ton = ^ jVi-3 Ni-Z 

where iVi is the number of pairs of observations ia the first 
sample, IVj the number in the second. 

This test may be illustrated with reference to observations 
on the timing of price changes during business cycles. 
For 111 commodities we have observatioixs on the timing 
of price declines in two successive periods of business 
recession occurring in the late 90’s and early 1900’s. The 
degree of relation between the time sequences of commodity 
price changes in these two recessions is indicated by a 
coefficient of correlation of -f- .22. For two similar (suc- 
cessive) periods in the 1920’s the measure of correlation, 
based on the prices of 121 commodities, has a value of 
-1- .36. There appears to have been a closer approach to a 
common pattern in the later period than in the earlier. 
In testing the significance of the difference between the two 
results we set up the hypothesis that the two samples were 
drawn from the same parent population, and that therefore 
the true value of the difference between the two coefficients 
is zero. 

For the two samples we have 

r - ,22: . = ,223,- - 4 - .0093 
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r - .36; . = .377; - jL . ,0086. 

The difference to be tested is 

D. = .377 - 223 = .164. 

The standard error of this difference is 


(To, = V.0093 + .0085 = . 133. 


We wish to know whether Dz is significantly different from 
zero. We compute, therefore, 


£>,- 0 


154-0 

.133 


1.16. 


Interpreting 1.16 as a normal deviate, we conclude that 
the difference is not significant. Dz differs from the hypothet- 
ical value of zero by only slightly more than one standai'd 
deviation. The results are not inconsistent with the 
hypothesis that the two samples are drawings from the 
same parent population. There is here no clear evidence 
that the degree of relationship between price movements 
in successive cycles was closer in the 1920’s than in the 
earlier period.^ 

Finally, making use of the z-transformation, we may 
combine results secured from the measurement of corre- 
lation in different samples. If we have two values of r, 
obtained from samples drawn from the same popvilation, 
a weighted average of the two will provide a better estimate 
of the true correlation than will either of the r’s, taken 
separately. For the averaging process we transform the 
r’s to z’s, wei^t each z by the corresponding N, less 3, 
and average them. Then, if desirable, the corresponding 
value of r may be determined. We may note that the 


1 The time factor enters to doud statistical inductions relating to samples 
drami from different periods (see above, Chapter XIV). Such an induction 
should be supported by evidence indicating that fundamental conditions in the 
held in question have not been altered over the time interval involved. This 
caution does not, of course, affect the procedure illustrated above. 
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standard error of the weighted average of the two z’a is 
given by 


The Chi-Square Test 

One of the great contributions of Karl Pearson to statisti- 
cal methodology was the determination of the form of 
the distribution of Chi-square, and the development of 
methods of utilizing this distribution. The character of 
this distribution and various tests based on it are our 
concern in the present section. 

THE NATURE OF CHI-SQUARE AND ITS DISTRIBUTION 

The quantity Chi-square (represented always by the 
symbol x^) is a measure of the degree to which a series of 
observed frequencies deviate from corresponding theoretical 
or hypothetical frequencies. The theoretical frequencies 
are set up on the basis of some hypothesis, some rational 
argument. The magnitude of the discrepancy between 
theory and observation is defined by the quantity x®* It 
was Pearson’s contribution to determine the nature of the 
distribution of the values of x* that would be obtained 
under given sampling conditions. Knowledge of this dis- 
tribution enables us to determine whether a given discrep- 
ancy between theory and observation may be attributed 
to chance, or whether it results from the inadequacy of 
the theory to fit the observed facts. This instrument is 
obviously one of extreme importance in statistical analysis. 

The character of the distribution of x* may be discussed 
with reference to Weldon’s date relating to the results 
obtained in 4,096 throws of 12 dice (see page 433). We 
call a 4, 6, or 6 spot a success, a 1, 2, or 3 spot a failure. 
When 12 dice are thrown the expected (or theoretical) 
number of successes on each throw is 6. A deviation from 
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6 represents a discrepancy between expectation and observa- 
tion. From the result of each throw of 12 dice a value of x® 
may be computed. Thus, a given throw yields 2 successes 
and 10 failures. The 2 successes represent a deviation of 4 
from the expected value of 6; the 10 failures represent a 
deviation of 4 from the expected value of 6. (In such an 
experiment as this there are two components of each value 
of x^ even though when one component is given the other 
is necessarily determined. For the sum of successes and 
failures must be 12 on each throw.) The value of x® in 
a given instance is obtained by squaring the discrepancies 
between expectation and observation, dividing the squared 
values by the corresponding expected values, and adding 
the quantities thus obtained. That is 

where fo denotes an observed frequency and f defines the 
corresponding theoretical frequency. 

In the case cited above we have 


X* 


(2-6)* (10-6)* 
6 6 


5 . 333 . 


On another trial, with 7 successes and 5 failures, we have 


(7-6)* (5-6)* 

C n I • 


333. 


On still another trial, giving 6 successes and 6 failures, we 
have 




The 4,096 throws thus yield 4,096 values of x® Tabulating 
these with respect to the frequency of occmrence of stated 
values, we obtain the distribution given in Table 140 on 
page 620. 

This table ^ves us information as to the nature of the 
discrepancies between theoretical norms and actual results 
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Table 140 


Tabulation of 4,096 Observed Valves of yf 


Valm 1 of X® 

i^fneasuring deviation of 

Frequency of 

Frequency of 

observation from expec- 

occurrence 

occurrence 

tancy in dice-throvring 

{absolute) 

{relative) 

experiment) 

Oto .833 

2,526 

.6167 

.833 to 2 167 

966 

.2358 

2 167 to 4. 167 

455 

.1111 

4 167 to 6.667 

131 

.0320 

Over 6 667 

18 

0044 

Total 

4,096 

1.0000 


that chance may bring about. For deviations from the 
expected frequency of successes, 6, may be attributed to 
the mass of undifferentiated causes we caU chance. The 
magnitude of x* varies, of course, with the degree of devia- 
tion. Values of x* not exceeding .833 are most frequent. 
Higher values of x® occur with decreasiug frequency. Only 
18 out of 4,096 observed values of x® exceed 6.667. This 
distribution furnishes us, therefore, with a standard of 
reference to employ when seeking to determine whether 
a given discrepancy between theoretical and observed values 
is attributable to chance, or whether it is too great to be 
so explained. 

This use of the table, as a standard for determining 
the probability that given discrepancies between theory 
and observation are attributable to the play of chance, 
is facilitated by a somewhat different arrangement. We 
may set up a table of cumulative values, based upon the 

^The 4,096 values of x* tabulated here constitute a discrete series. The 
conditions of the experiment are such that the 4,096 observations on x’ 
distributed among oiJy six values, ranging from 0 to 8.333. In order tlmt the 
observed frequencies of occurrence of stated values of x’ uaay be compared (in 
a later table) with theoretical frequencies, an uneven dai^interv^ is em- 
ployed above. Glass limits are taken midway between successive valiies at 
which the actual observations fall. (The decimal fractions wed in the table do 
not define these limits with full accuracy.) 
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tabulation of the 4,096 values of x* obtained in the preceding 
experiment. These are given in Table 141. 

The entries in col. (2) of this table indicate that in the 
experiment involving 4,096 throws of dice, a value of x* 
of 6.667 or more occurs less frequently than 1 time out 
of 100 (only 44 times out of 10,000, in fact). A value as 
great as 4.167, however, occurred more frequently than 
3 times out of 100. If we interpret these relative frequencies 
as probabilities, we may obtain from such a table a knowl- 
edge of the probabilities corresponding to stated values 
of X®- Here is the instrument we desire, in seeking to deter- 
mine whether given observations conform closely enough 
with e^ectations based on theory, or on working hypotheses 
which perhaps are not yet ready to be dignified as theories. 


Table 141 


Cuimdative Relative Frequencies of Occurrence of 4,096 Observed 
Values of x*) urith Corresponding Theoretical Frequencies * 


( 1 ) 

Value of X* 
(jnmukdm donation 
of observation 
from expectancy) 

0 or more 
.833 or more 
2.167 or more 
4 167 or more 
6.667 or more 


( 2 ) 

Relative frequency 
of occurrence 
(observed) 

1.0000 

.3833 

1476 

.0364 

.0044 


(3) 

Rdative frequency 
of occurrence 
(fheoredcal) 

1.0000 

.3613 

.1411 

.0412 

.0098 


We should note two important limitations attaching to 
the entries in col. (2) of the above table, showing relative 
frequencies corresponding to stated values of x®- In the 
first place, these are merely empirical results, obtained 
from a given set of experiments. The conditions of the 
experiment yield a discontinuous series of values for x®- 
In some degree, this discontinuity has been ironed out by 

^ One degree of freedom is present' in the determination of a sin^e value of 
iu this example* 
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the method of classification employed, but the instrument 
derived from this single experiment remains an imperfect 
one. The effects of chance fluctuations are present in 
these results, also, and contribute to the imperfection of 
the instrument. The true distribution of x* is only approxi- 
mated by the results presented in col. (2) of Table 141. 

The entries in col. (3) of Table 141 are free of this lim- 
itation. These record the frequencies with which values 
of X* falling within the limits indicated in col. (1) might 
be expected to occur, on the basis of mathematical theory, 
under the conditions of the present e3q)eriment.^ These are 
the entries which provide the standard we desire, in deter- 
mining the significance of a given series of discrepancies 
between observation and expectation. It is to be noted, 
however, that the empirically derived table constitutes a 
fair approximation to the theoretical distribution of x® 
under these conditions. 

The second limitation attaching to the example cited 
above is that each of the 4,096 values of x® tabulated has 
two components, and that the experiment is such that 
when one component is given the second is necessarily 
determined. (Since there are 12 events in each throw we 
know, for example, that if we have 8 successes there must 
be 4 failures.) This condition is described by saying that 
there is but one degree of freedom in the derivation of 
a given value of x®- The table we have obtained relates, 
therefore, to a special case — the distribution of values 
of X® computed with one degree of freedom. There are 
other possible cases. For each of these the distribution of x® 
may be determined in a manner similar to that shown above. 

As an example of a different set of conditions we may 
consider the outcome of a throw of 24 dice, accoimt being 
kept of the frequency of occurrence of each possible result 

* These relative freqaendes are taken from G. tJdney Ynte “Table of the 
values of P for divergence from independence in ti^ fourfold table,” Jaurnal 
cf th« Royal StaUeRcal Society, Vol. LXXXV, January, 1922, 103-104. 
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(i.e., the appearance of a 1, 2, 3, 4, 5, or 6 spot). When 24 
dice are thrown there may be expected 4 one spots, 4 
two spots, 4 three ^ots, etc. In a given throw we obtain 
the following results: 

NuTnber of spots 

1 2 3 4 5 6 

Observed frequency 2 5 6 4 4 3 

E3q)ected frequency 4 4 4 4 4 4 


For the results of this throw the value of Chi-square would 
be given by 


_ (2 - 4 )^ (5 - 4 )^ (6 - . (4 - 4 )^ (4 - 4 )» 

/C ^ "i ^ I ^ "i ^ I* ^ 


I (3-4)» 
4 


2.60. 


This quantity has six components. However, as soon as 
five are given the sixth is determined, since the total number 
of events is fixed at twenty-four. There are, then, five 
degrees of freedom in the calculation of x* in this experiment. 

If Ae_^-dice-were -thrown a thousand timeSi say, we 
should^^ve one thousand values of x®- A distribution 
of these could be constructed, similar to that derived 
empirically for the case in which there was one degree 
of freedom. It would be a different distribution, however, 
for the change iu degrees of freedom has an obvious relation 
to the magnitude of x*- The character of the distribution 
of the values of x® that would be obtained iu such an experi- 
ment is indicated by the entries in Table 142 on page 624. 
We do not here give empirical values, as in the preceding 
example. The table shows the theoretical frequencies with 
which given values of x® occur, when five degrees of freedom 
prevail. 

In using tables of this sort we may interpret meas- 
ures of relative frequency as probabilities. Thus we may 
read Table 142, which relates to the distribution of x® 
computed with five degrees of freedom, as follows: If 
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Table 142 

Tabulation oj x® Comjniied wiih Five Degrees of Freedom, with 
Cumulative Relative Frequencies ' 



Relative frequency 

Valve of 

of occurrence 
\Oieoretncal) 

0 or more 

1 0000 

1 or more 

.9626 

2 or more 

.8491 

3 or more 

.7000 

4 or more 

.5494 

5 or more 

.4159 

6 or more 

.3062 

7 or more 

.2206 

8 or more 

.1562 

9 or more 

.1091 

10 or more 

0752 

11 or more 

.0514 

12 or more 

.0348 

13 or more 

0234 

14 or more 

.0156 

15 or more 

0104 

16 or more 

-.0068 

30 or more 

.000015 

00 

.000000 


the true value of x® is zero (i.e., iu an infinitely large sample 
observed frequencies wo;ild agree precisely with the theo- 
retical frequencies we have set up), the probability of our 
securing a x® of zero or more, from a sample of the t 3 q)e 
here employed, is 1.00; the probability of our securing 
a X® of 1.00 or more is 9,626/10,000; the probability of 
our securing a x® of 3.00 or more is 7/10; the probability 
of our securing a x® infinitely large is 0. The quantities 
X® and P stand, thus, in a definite functional relationship, 
for any given value of n (n denotes the number of degrees 
of freedom). At the two limits the relationships are the 

1 From the table prepared by W. P. Mderton and given in Tables for StcMs- 
Hctans and Bumetrictavia^ Karl Pearson, editor, 26. The n' of Elderton's table 
is equal to n + !» for an example of the t 3 rpe here given. 
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Table 143 ‘ 


Table of x* for Selected Valves of P and n 


n 

P= 99 

95 

50 

.10 

.05 

02 

01 

1 

000157 

00393 . 455 

2 706 

3.841 

5.412 

6 635 

2 

0201 

103 

1.386 

4 605 

5.991 

7.824 

9 210 

3 

115 

.352 

2.366 

6 251 

7.815 

9 837 

11.341 

4 

.297 

711 

3.357 

7 779 

9.488 

11.668 

13 277 

5 

554 

1.145 

4.351 

9.236 

11 070 

13 388 

15.086 

6 

.872 

1 635 

5.348 

10 645 

12.592 

15.033 

16 812 

7 

1.239 

2 167 

6.346 

12 017 

14 067 

16 622 

18 475 

8 

1.646 

2 733 

7.344 

13 362 

15.507 

18.168 

20.090 

9 

2 088 

3.325 

8.343 

14.684 

16.919 

19.679 

21 666 

10 

2 558 

3.940 

9.342 

15.987 

18.307 

21 161 

23 209 

11 

3 053 

4.575 

10.341 

17 275 

19.675 

22 618 

24.725 

12 

3.571 

5 226 

11.340 

18 549 

21.026 

24.054 

26.217 

13 

4.107 

5 892 

12 340 

19 812 

22.362 

25 472 

27.688 

14 

4.660 

6 571 

13.339 

21 064 

23 685 

26 873 

29.141 

15 

5 229 

7.261 

14.339 

22 307 

24 996 

28.259 

30 578 

16 

5 812 

7 962 

15.338 

23.542 

26.296 

29.633 

32.000 

17 

6 408 

8.672 

16.338 

24 769 

27.587 

30.995 

33 409 

18 

7.015 

9.390 

17 338 

25.989 

28 869 

32.346 

34.805 

19 

7 633 

10 117 

18.338 

27.204 

30.144 

33 687 

36 191 

20 

8.260 

10 851 

19 337 

28 412 

31 410 

35.020 

37.566 

21 

8.897 

11 591 

20 337 

29.615 

32.671 

36.343 

38.932 

22 

9 542 

12 338 

21 337 

30 813 

33 924 

37.669 

40.289 

23 

10.196 

13.091 

22.337 

32.007 

35.172 

38.968 

41.638 

24 

10.856 

13 848 

23 337 

33.196 

36 415 

40.270 

42.980 

25 

11.524 

14.611 

24.337 

34.382 

37 652 

41.666 

44.314 

26 

12.198 

15.379 

25 336 

35.563 

38 885 

42.856 

45.642 

27 

12.879 

16.151 

26.336 

36.741 

40.113 

44 140 

46.963 

28 

13.565 

16 928 

27.336 

37.916 

41.337 

45.419 

48.278 

29 

14.256 

17.708 

28.336 

39.087 

42.557 

46.693 

49.588 

30 

14.953 

18.493 

29.336 

40.256 

43.773 

47.962 

50.892 

same for all 

values 

of n. When x® 

= 0, P 

= 1.00; 

; when 

X* 

= 00 , P = 

0.00. 

But for ; 

intermediate values the rela- 


tionship vaxies with n. 

In 1900 Karl Pearson defined the distribution function 


1 This table is reproduced Here through the courtesy of R. A. Fisher and his 
publishers, Oliver and Boyd, of Edinburgh. The entries are taken from 
Table III of StatisMcd Methods for Research Workers. 



626 STATISTICAL INDUCTION 

of X®* ^ The actual application of the x® test is facilitated by 
prepared tables. Selected entries from these tabulations, 
for different values of n, are given in Table 143 on page 625 
and in Appendix Table V. 

For determining the significance of x® beyond the range 
of this table, Fisher has given V'2x® — v'2n— 1, as a value 
which may be interpreted as a normal deviate. That is, 
the figure derived when stated values of x® and n are inserted 
in the above expression is to be taken as a deviation from 
the mean of a normal distribution, expressed in units of 
the standard deviation of that distribution. The corre- 
sponding value of P is then derived from a table of areas 
under the normal curve. 

The X® test is applicable to a considerable variety of 
problems. "Wherever, on rational grounds, a set of theoretical 
frequencies may be derived, for comparison with observed 
frequencies, this test is appropriate in judging of the 
significance of the discrepancy between the two sets of 
frequencies. In the following pages three applications of 
this test are exemplified. 

THE Cm-SQUABE TEST OP GOODNESS OP PIT 

When an ideal frequency curve, whether normal or of 
some other type, is fitted to an actual frequency distribution, 
theory and observation are being compared. A test of the 
concordance of the two (i.e., of goodness of fit) may be 
made by inspection, but such a test is obviously inadequate. 
Precision may be secured by employing the x® test. The 
example in Table 144, relating to the distribution of tele- 
phone subscribers discussed in Chapter XIII, illustrates the 
procedure. 

There are 15 classes in this distribution. Since the total 

*Cf. “ Qa the Criterion that a Given Syntem of Deviations InniL the Probable 
in the Case of a Correlated System of Variables is such that it can be Beason- 
ably Supposed to have Arisen from Random Sampling.” PkUoaophicai Maga- 
zine, 5th Series, Vol. L, 1900. 



GOODNESS OF FIT 


627 


Table 144 

Comjmtation of yf for Testing Goodness of FU 
Normal Curve of Error Fitted to Disiribfuiion of Telephone Subscribers 


(1) 

(2) 

(3) 

(4) 

(5) 

Class 

limits 

Observed 

frequency 

/o 

Theoretical 

frequency 

f 

(/o-y) 

(fo-f)* 

f 

150 and less 

10 

13 14 

- 3.14 

.75 

150-200 

19 

16 76 

-f- 2.24 

.30 

200-250 

38 

31 57 

-H 6 43 

1.31 

250-300 

50 

53.02 

- 3 02 

.17 

300-350 

95 

79.43 

+ 15 57 

3.05 

350-400 

85 

106 10 

-21.10 

4.20 

400-450 

115 

126 41 

- 11.41 

1.03 

450-500 

132 

134.31 

- 2.31 

.04 

500-550 

144 

125.50 

+ 18.50 

2.73 

650-600 

116 

106 51 

-1- 9.49 

.86 

600-650 

79 

81.85 

- 2.85 

.10 

650-700 

54 

55 21 

- 1.21 

.03 

700-750 

31 

33.19 

- 2.19 

.14 

750-800 

11 

17 81 

- 6 81 

2.60 

More than 800 

16 

14 19 

-f- 1 81 

.23 


995 

995 00 

15 groups 

X* = 17.53 


theoretical frequencies must equal the total observed fre- 
quencies, the entry in the fifteenth class is fixed when the 
other 14 are established. The given value of 17.53, 
is determined, therefore, with 14 degrees of freedom. From 
Table 143 we see that when n = 14 a value of x® as great 
as 23.686, or greater, would occur purely as a result of 
chance in 5 out of 100 random samples, if the true value 
of X* were zero. The value of 17.53 secured above is not 
excessively hi^, therefore. The discrepancies between the 
observed and theoretical frequencies in Table 144 could 
easily have arisen as a result of chance. The fit obtained 
with the normal curve is acceptable. Which is to say that 
our results are not inconsistent with the hypothesb that 
the normal law of error defines the distribution of residence 
telephone subscribers, classified on the basis of message use. 
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In applying the Chi-square test it is not necessary to 
detemiine the exact probability corresponding to a stated 
value of X®' Our purpose, in general, is to ascertain whether 
observed results axe or are not consistent with the hypothesis 
on which the fitting procedure is based. For this purpose 
we wish only to know whether the value of P corresponding 
to a given value of x® falls below (or, much more rarely, 
above) certain critical values. As a conventional limit .05 
is usually employed. If a value of x® is such that P is below 
.05, the discrepancies between observed and theoretical 
values are, on this standard, considered too great to be 
attributed to chance. The hypothesis on the basis of which 
the theoretical frequencies have been determined is suspect, 
in such a case. If x* is large enough to give values of P 
below .02 or .01, the inadequacy of the hypothesis is, 
of course, more strongly indicated. 

R. A. Fisher points out that suspicion should attach 
to very low values of xS which give values of P of .99 
or thereabout. These values indicate a very close agreement 
between the hypothesis and the observed facts. Such close 
agreement may be due to chance, but there is strong probar- 
bility that the hypothesis is at fault or, in mathematical 
terms, that the wrong fxmction is being used. Coincidence 
of observed and theoretical values suggests the kind of 
agreement one obtains by fitting to n points a cm-ve in 
the equation to which there are n constants. Any artificial 
forcing of agreement between hypothesis and observation 
of course invalidates the application of the Chi-square test. 

In applying the Chi-square test it is convenient to use 
the conventional standards we have noted, as guides to 
the rejection or provisional acceptance of working hypothe- 
ses. It is unwise to use these standards arbitrarily, however. 
No single standard possesses significance in any absolute 
sense. The investigator in a given field of research will 
interpret the information such a test yields in the light of 
other knowledge relating to that field of e^iperience, and with 
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due regard to the rational foundation of the hypotheses 
being tested. 

One feature of Table 144 requires explanation. It will 
be noted that in the construction of this table the three 
classes at the lower end of the distribution have been lumped 
into one, and that the same thing has been done with the 
six classes at the upper end of the distribution (Cf. Tables 
109 and 144). This is done to avoid the undue magnification 
of slight differences between the tails of the observed and 
theoretical distributions. When /, the theoretical frequency, 
is very small, a relatively slight absolute discrepancy between 
/o and / may serve to swell materially the value of x*- The 
lumping process is designed to prevent such a distortion. 
Since the selection of classes for combination rests on the 
personal judgment of the investigator, a subjective element is 
necessarily introduced here. However, the results of the 
test wiU not usually be much affected by minor variations 
in the combination of tail-end classes.^ 

The use of x* iii testing the fit of theoretical frequency 
curves is subject to another rather important limitation. 
In the computation-of xlnp..aecouitt. is. taken.<tf-4he distribu- 
tion of discrepancies between fn and f. Yet the manner in 
winch these discrepancies are distributed may materially 
influence our judgment as to the goodness of a given fit. 
In such an example as that given in Table 144, the successive 
values of /o — /, counting from the lower limit of the ic-scale, 
might be alternately positive and negative. Something 
approaching this alternation would be expected if chance 
factors alone accounted for the differences between observed 
and theoretical frequencies. But the differences mi^t be 
distributed otherwise. All the values of fo—f below the 

1 Considerations of the same sort suggest that a sample of reasonable size is 
needed for the valid application of the Chi-Square test in curve fitting. Deming 
and Birge set 500 observations as the minimum required in a test of this type, 
if confidence is to be placed in the result. Yule and Kendall suggest a sm^er. 
number, but place emphasis on the need of an adequate number of theoretical 
observations (preferably not less than 10) in every dass. 
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mode might be positive, while all the values above the 
mode might be negative. The cumulated discrepancies, 
as measured by x®> might be equal in the two cases, yet 
far more confidence would attach to a fit marked by alterna- 
tions of plus and minus deviations than to one in which 
a series of positive deviations were bunched together on 
the scale, and negative discrepancies were correspondingly 
clustered. This limitation serves as a warning against 
purely Tnft nbfl.nina.l use of the x® test. Examination of the 
fit, and interpretion of x* m the light of the actual distribu- 
tion of discrepancies, are required in the application of 
this test. 

THE Cm-SQUAEB TEST OP INDEPENDENCE OP PBINCIPLES OP 
CLASSIPICATION^ 

A question that frequently arises in research has to do 
with the relation between two principles of classification. 
Thus, m studying commodity price movements during 
revivals after business depressions, we may divide all com- 
modities into durable and non-durable classes. We may 
again divide them into those the prices of which precede 
the general average of commodity prices in the revival,, 
and those that lag behind the general mdex. If the quality 
of durability has no relation to the timing of price recovery, 
the two principles of classification are independent. How- 
ever, certain considerations relating to the character of 
demand for durable and non-durable goods lead us to 
believe that the durability of a good is related to the 
behavior of its market price during a period of business 
revival. It is possible to apply an objective test to determine 
whether these principles of classification are, in fact, related. 

Observed frequencies are recorded in Table 145.® 

^ For a discussion of tests of independence and homogeneity see Chapter IV 
of Statistical Methods for Research Workers, by R. A. Fisher. 

s Data from The Behamor of Prices, National Bureau of Economic Research, 
New York, 1927, with later additions. 
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Table 145 


Observaiion 

Two-fold Classification of 208 Commodities 


CommodUy group 


Durable goods 
^on-durable goods 
Total 


Number 
preceding 
general index 
on price me 


6 

56 


Number 
lagging behind 
general index 
on price rise 


61 

91 

152 


Teded 


67 

141 

208 


The nature of the durability classification requires no 
explanation. The classification relating to the timing of 
Drice changes in business revival is based on the average 
Dehavior of each of the 208 commodities during 13 periods 
if business revival occurring between 1890 and 1936. The 
Drocess of cross-classification gives four “cells” among 
ivhich the 208 commodities are divided in the manner 
ndicated in the table. 

With the observed frequencies that constitute the entries 
n these four cells we may compare a set of theoretical 
'requencies, derived from the hypothesis that the durability 
if economic goods has no relation to the timing of price 
idvances after business depressions. These expected fre- 
ijuencies are computed readily from the sub-totals. The 
37 durable goods constitute 32.21 per cent of aH the com- 
modities, while the 141 non-durable goods constitute 67.79 
per cent of the total. If durabilty has no relation to the 
uiming of price advance, after depression, we should expect 
bhe 56 commodities that preceded the general index to 
be divided between durable and non-durable goods in this 
same proportion. That is, 32.21 per cent of the 56 com- 
modities, or 18.04, should be durable, while 67.79 per 
cent of the 56, or 37.96, should be non-durable. Similarly, 
the 152 commodities lagging behind the general index in 
Qrice revival should be divided between the durable and 
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iion-durable categories in exactly the same way, 32 21 per 
cent in the durable class, 67 . 79 per cent in the non-durable. 
These expected frequencies, which conform to our hypothe- 
sis that the two principles of classification are independent, 
are given in Table 146. 


Table 146 
Expectation 


Two-fold Classification of 208 Commodities 
Commodity group Expected frequencies 



Number 

Number 



preceding 

lagging behind 

Total 


general index 

general index 


on price rise 

on price rise 


Durable goods 

18.04 

48 96 

67 

Non-durable goods 

37 96 

103 04 

141 

Total 

66.00 

152.00 

208 

Chi-square is computed from the relation x* = 2 1 ^ 

-fy\ 
f y 

in the following manner: 



8 (6 - 18.04)* (50 - 37.96)* (61 

^ 18.04 37.96 

- 48.96)* 
48.96 



+ (91 

- 103.04)* 
103.04 

16.222. 


There are four components of Chi-square in this instance, 
but, as may readily be seen by reference to the table of 
expected frequencies, only one degree of freedom enters 
into its computation. The expected frequencies must yield 
the four group totals, 66, 152, 67, and 141, Accordingly, 
as soon as we fill one of the four cells set up by the process 
of cross-classification, the other three are deWtely deter- 
mined. Given 18.04, the expected number of durable 
goods preceding the general index in price revival, the 
entries in the other cells are fibred. Subtraction of 18.04 
from 56 and 67 will fidl two of them, and the filling of these 
determines the fourth. 
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For the interpretation of the given value of Chi-square 
we turn to Table 143, which is to be read with n, the number 
of degrees of freedom, equal to 1. If durability of economic 
goods has no relation to the timing of their price changes 
in revival, the two principles of classification employed 
above are independent and the true value of Chi-square 
is zero. Are the observed results consistent with this 
hypothesis? The entries in Table 143 indicate that if the 
true value were zero, a value as great as 3.841 would 
occur 5 times out of 100, as a result of chance fluctuations, 
A value as great as 6.635 would occur only 1 time out of 
100. The present value of Chi-square, 16.222, represents 
a still smaller probability. The results are not consistent 
with the hypothesis we have set up. The differences between 
the observed and expected frequencies are too great to be 
attributed to the play of chance. Durability, and factors 
of demand and supply related thereto, appear to play a 
definite role in the timing of price advances in business 
revivals. 

This test, it should be noted, does not define the relation- 
ship between durability of goods and the timing of price 
revival. It leads us to reject the hypothesis that durability 
has no bearing on the sequence of price advances in revival. 
If, on the basis of some other rational h 3 q)othesis, we could 
obtain a set of expected frequencies representing a definite 
relationship other than one of independence, this hypothesis 
could be tested in the same manner. From the present 
evidence, however, we may only conclude that the proportion 
of durable goods preceding the general price index on revival 
is smaller and the proportion of non-durable goods larger 
than would be expected if durability had no relation to 
the tiTning of price recovery after a business depression. 

THE CHI-SQUrABB TEST OP HOMOGBNEITT 

For each of eight major industrial groups we have records 
showing, for the year 1933, the number of corporations 
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reporting net incomes from their operations and the number 
reporting no net incomes (i.e., suffering deficits). The 
returns relate to a total of 492,649 corporations. Is this 
total a homogeneous whole, or does the division of corpora- 
tions between those earning net incomes and those suffering 
deficits vary significantly from group to group? The records 
appear in Table 147. 


Table 147 

Coniparison of Observed and Theoretical Frequencies 
(Tabulations based on corporate income tax returns for 1933, by major 
industrial groups i) 


(1) 

(2) 

(3) 

(4) 

(6) 

(6) 

(7) 



Actual 

number 

Theoretical 





Total 

of 

{expected) 




Group 

number 

returns 

number of 




of 

shavnng 

returns 





retuma 

no net 

showing no 






income 

(fo) 

net income 

CO 

/o-/ 


(fo-f)* 

f 

Agriculture and 







related indus- 
tries 

10,490 

7,818 

7,150 

+ 668 

446,224 

62.4090 

Mining and 







quarrying 

17,147 

8,866 

11,688 

- 2,822 

7.963,684 

681.3555 

Manufacturing 

93,833 

62,295 

63,958 

- 1,663 

2,766,569 

43.2404 

Construction 

Transportation 

18,234 

14,122 

12,428 

+ 1,684 

2,835,856 

228.1828 

and other pub- 
lic utilities 

24,302 

14,349 

16,564 

- 2,215 

4,906,225 

296.1980 

Trade 

137,858 

93,621 

93,965 

- 344 

118,336 

1.2594 

Service 

47,843 

35,419 

32,610 

-h 2,809 

7,890,481 

241.9650 

Finance 

142,942 

99,314 

97,431 

-f 1,883 

3,545,689 

36.3918 

Total 

492,649 

335,794 

335,794 



1,501.0019 

Per cent 

100.000 

68.161 





The observed frequencies are, of course, the actual returns 
given in col. (3) of Table 147. A set of theoretical or 
expected frequencies, for comparison with these, noay be 
set up on the assumption that all corporations in the United 
States constituted a homogeneous population as regards 
the likelihood of suEEeiing a deficit in 1933. On this as- 

^ From StaiUstica of Income for 193S. U. S. Treasury Department, Washing- 
ton, D. C. 
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sumption we may say that the probability of failing to 
earn a net profit was, for all the elements of this assumed 

335 794 

homogeneous population, ^ or .68161. If this is the 

true probability for all elements of the population, we may 
determine a theoretical frequency for each indt*strial group 
by applying this ratio to the total number of corporations 
in that group. On the assumption made we should find, 
in all groups, the same proportionate division between 
corporations earning net incomes and those suffering deficits, 
except for modifications due to fluctuations of sampling. 
The expected frequencies appear in column (4). If the 
hypothesis of homogeneity is valid, these are the true 
frequencies for the several groups. Differences between 
these and the observed frequencies reflect the play of chance 
alone. 

The calculation of x®, measuring the degree of discrepancy 
between the observed and theoretical frequencies, is shown 
in cols. (5), (6), and (7) of Table 147. The value of x*) 
computed with 7 degrees of freedom, is 1,591 . 0019. Since 
the 1 per cent value of x®j for n = 7, is only 18.475, the 
conclusion is clear that the discrepancy is too great to 
be attributed to chance. The results are not consistent 
with the hypothesis of homogeneity. We are not justified 
in assuming that the forces affecting the profitability of 
corporate operations in 1933 were the same, among the 
ei^t major industrial groups here reprei^nted. 

The various procedures discussed in this chapter give 
some indication of the variety and power of the methods 
available for use in interpreting and appraising the results 
of statistical research. Each one involves, in some form, 
the testing of hypotheses against evidence yielded by the 
study of samples. It should be emphasized that the formal 
procedures described in the preceding pages are employed at 
a rather late stage in actual research work. The experiment 
will have been planned, the field work done, hypotheses 
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framed before the tests here discussed can be applied. 
These various steps must, of course, be coordinated. The 
data must be gathered with reference to the hypotheses 
to be tested and to the analytical methods to be employed. 
Acquaintance with appropriate techniques is one pre- 
requisite of jiteUigent planning of research in which quanti- 
tative data are utilized. Familiarity with the characteristics 
and limitations of the available materials, and clear definition 
of the questions at issue, are equally important elements. 
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THE METHOD OF LEAST SQUARES AS APPLIED 
TO CERTAIN STATISTICAL PROBLEMS 

The method of least squares in the case of a single 
unknown quantity is merely a procedure for obtaining the 
most probable value of that quantity from a number of 
separate observations. The most probable value is that 
for which the sum of the squares of the deviations (or 
residuals) is a Tninimiim. This is the arithmetic mean of 
the observations. 

Where the measurements or observations do not relate 
directly to a single unknown quantity, but to functions of a 
number of unknown quantiti^, the problem is somewhat 
different. In the first case mentioned each observation is 
in the form of a single magnitude. In the present case 
each observation is in the form of an observation equation 
in which the observed values of the variables, as foimd in 
combination, are entered. The unknown quantities are 
the constants which define the functional relationship 
between the variables in question. Our problem is that 
of finding the most probable values of these constants, the 
true values being unknown. 

As in the simpler case the most probable values are 
those for which the sum of the squares of the residuals 
is a minimum. In this case, however, the residuals are 
deviations, not from a single magnitude, as in the case 
of the arithmetic mean, but from the curve which describes 
the most probable functional relationship. The residuals 
are the differences between the computed and the actual 
values of the dependent variable. 

ess 
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DERIVATION OF THE NOEMAIj EQUATIONS 

Representing by F an observed value of the dependent 
variable, by F« the corresponding computed value, by v the 
residual, or difference between F and Ye, and by Wi, Wa, 
Wz, and Wz different independent variables (or different 
ftmctions of a single independent variable), we may write 

F« = fWi, TYa, Wz, Wi) 

V = Ye -Y 

= f(Wi, Wa, Wz,Wi) — Y 
S(»*) = S[/(lFi, Wa, Wz, Wi) - F]*. 

If the fimction in a particular case is of the type 

Ye = aWi + + cWz + dlF4 

we have 

S(»*) = SKoTTi + fcTT* + eWz + dW^ - Y]K 

Our problem is that of determining the most probable 
values of the constants that define the function. These 
constants are represented, in the present case, by a, h, c, 
and d. (The TV’s, it should be noted, refer to quantities 
which are known, once the observation equations are given. 
In the usual case the TF’s are different functions of a single 
variable, but this is not essential.) On the assumption 
that the errors of observation are distributed in accordance 
with the normal law of error, it may be demonstrated 
that the most probable values of a, h, c, and d, in the above 
equation, are those which render 2}(t;^) a minimum; i.e., 

SKaTTi + 6TFj + cWz + dPT*) — F]® = a minimum. (a) 

The normal equations necessary for the solution may be 
obtained by equating to zero the partial derivatives of 
the above expression with respect to the unknowns, a, b, 
c, and d. That is, we first differentiate the above fimction 
with respect to a, holding b, c, and d constant, then with 
respect to holding a, c, and d constant, then with respect 
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to c, holding a, fe, and d constant, then with respect to d, 
holding o, b, and c constant. Carrying through this operation 
with respect to a, we have 

^S[(aFi + hWi + cWi + dWi) - YV = 0 
od 

or 

I SFi[(oFi + bWi + cFs + dWi) - K] = 0. 

Differentiating equation (a) now with respect to b, we have 

^S[(aFi + bWi + cWi + rfFi) - 1? = 0 


or 

II SF2[(oFi + bWi + cWb + dWi) - y] = 0. 
Differentiating equation (a) with, respect to c, 

|-S[(aFi + bWi + cFs + dF«) - 7]* = 0 

OC 

or 

III SF,[(oFi + bFj + cFs + dWi) - 7] = 0. 
Differentiating equation (a) with respect to d, 

As[(aFi + 6Fs + cWz + dWz) - 7]® = 0 


or 

IV SF4[(aFi + bWz + cWz + dFO - 7] = 0. 

The most probable values of the quantities a, b, c, and 
d are secured by solving simultaneously the four normal 
equations thus obtained (numbered above I, II, III, IV). 

FOEMATION OF THE NOHMAIi EQUATIONS 

When the observation equations are all of the first degree 
(i.e., of the first degree with respect to the unknown quan- 
tities, o, 6, c, etc.) the normal equations may be secured 
by the following process; 
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1. Write the equation which describes the assumed relationship. 
The observation equations are derived by substituting in this 
equation the observed values of the variables, as found in com- 
bination. 

2. Multiply each observation equation by the coefficient of the 
first unknown in that equation; the sum of the resulting equations 
constitutes the first normal equation. 

'3. Multiply each observation equation by the coefficient of the 
second unlmown in that equation; the sum of the resulting equa- 
tions constitutes the second norm^ equation. 

Continue this process until normal equations equal in 
number to the unknown quantities are obtained. 

The actual process of forming the normal equations in 
curve fitting may be simplified, and the writing out of the 
separate observation equations avoided, as was demonstrated 
in earlier sections. The following may be laid down as 
general rules for the formation of the desired normal 
equations: 

1. Write the equation of the curve to be fitted. For the purpose 
of this explanation we may employ the general form 

Y = aWi + bWt + cWt -4- dWi + . . . (1) 

where Y represents the dependent variable, a, b, c, d, . . . repre- 
sent the constants in the equation (the unknown quantities in the 
present instance) and Wi, W 2 , Wt, Wt, . . . represent the coeffi- 
cients of these unknowns. It is assumed that these coefficients 
represent variables, and that term is used with reference to them. 
Call this equation (1). 

2. Multiply each term in equation (1) by the coefficient of the 
first unknown in (1) (i.e., by WO and place the siunmation sigpr, 
S, before each variable. This is the first normal equation (I). 

3. Multiply each term in equation (1) by the coefficient of the 
second unknown (i.e., by WO and place the summation sign before 
each variable. This is the second normal equation (II}- 

4. Multiply each term in equation (1) by the coefficient of the 
third unknown (i.e., by Ws) and place the summation sign before 
each variable. This is the third normal equation (III). 

5. Multiply each term in equation (1) by the coefficient of the 
fourth unknown (i.e., by WO and place the summation sign before 
each variable. This is the fourth normal equation (IV). 
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The process may be continued tmtil nor m a l equations 
equal in number to the unknown quantities are obtained.^ 

A STANDARD SET OP NORMAL EQUATIONS 

As a set of generalized normal equations secured by the 
above process and applying to any equation which can be 
put in the form 

Y = aWi + bWi + cW t + dW i + . . . , 

we have 

I S(Fi7) 

= oSCPTi*) + bSiWiWi) + cSCFiFs) + dS(FiF0 + . . . 

II S(FsF) 

= aSCFiF*) + 6 S(Fs‘) + cSCFsFs) + dS(F2F4) + . . . 

III S(F,y) 

= oS^FiFs) + ^^(FaFs) “|~ cS(F 3 ®*) + £CS(F 3 F 4 ) + . . , 

IV S(F47) 

= aS(FiF 4 ) + 6 S(F 2 F 4 ) + cZ{W$W^ + dS(F4®) + . . . 

By substituting for Fi, Fj, Wi, W*, etc., the particular 
functions employed m a given case, these equations may 
be readily adapted to any type of curve in the fitting of 
which the method of least squares is applicable. Thus in 
fitting a curve represented by the equation 

F = o + 6 Z + cZ* + 

substitutions in the standard normal equations given above 
are based upon the following relations: 

Fi = 1 

Fs = Z 
Fs = 

F 4 = Z». 

The changes to be made in the normal equations are 
obvious. S(Fiy) becomes 2(F); 2(Fi*) is equivalent to 
2(1*), which is equal to N, the total number of observations. 

^ These rules represent an adaptation of a similar series formulated by 
Haymond Pearl in Medical Biometry and Staiietics, 341. 
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The first normal equation becomes 

2(7) =Na + 62(Z) + c2(Z») + d2(Z»). 

The other normal equations are modified correspondingly. 

In the example just given, the coefficients are all different 
fimctions of a single independent variable, X. It is not, of 
course, essential to the method of least squares that this 
be so. The coefficients, Wi, Wi, TTs, etc., may represent a 
number of independent variables, as in the case of multiple 
correlation. 

The limitations to the method of least squares must be 
borne in mind in making use of it. This method, m its 
direct application, is limited to cases in which the equation 
to the curve to be fitted is linear in the constants, i.e., the 
observation equations must aU be linear as regards the 
tmknown values, a, 6, c, etc. (This does not mean, of course, 
that the equation to the fitted curve must be linear.) As 
an example of this limitation, we may cite a curve having 
as equation y - which cannot be fitted directly by 
the method of least squares. If the observation equations 
are non-linear they may be reduced to the linear form in 
many instances by the use of logarithms, and the method 
of least squares then employed. 

DEBIVATION OF THE FOBMtTLA FOB THE BTANDABD ERBOB 

OF ESTIMATE 

It has been pointed out in the body of the text that the 
standard error of estimate may be derived as a by-product 
of the method of least squares. A more complete demon- 
stration of this process may be given at this point. 

When the partial derivative of the expression 

2[(oTri + bWt -b cWt + dWt) — 7]* = a minimum 

is equated to zero, with respect to the first unknown, a, 
we have 

S17i[(aTfi -b bWi + cTT, -b dFO - 7] = 0. 
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Since 

aWi + bWs + cWt + dWi-Y = v, 

we have as a necessary condition of fitting 

2(t;TFi) = 0. 

When the partial derivative of the same expression with 
respect to b is equated to zero, we have 

SlFs[(aTri + bWi + eWi + ^^4) - F] = 0 
or, Tnfl.1dTig the same substitution as in the preceding case, 

S(uTFa) = 0. 

Repeating the operation with respect to c and d, we may 
show that 

S(uTFs) = 0 
and 

S(aTf«) = 0. 

In summary: When the method of least squares is 
employed in determining the most probable values of cer- 
tain uhknowm quantities, having as known coef&cients the 
quantities Wi, Wa, W 3 , Wi, the following relations hold 
as a necessary condition of the least squares method: 

S(«>TF0 = 0 
SCoFj) = 0 
SCwTTa) = 0 
SCrFd = 0. 

A knowledge of these relationships gives us a method of 
securing readily the value and the standard error of 
estimate. Assume that, by the method of least squares, we 
have determined the constants in an equation of the type 

Yc = aWi -1- bWa 4- cWz -|- dWi. 

For each residual we have the relation 


V = aWi -|- 6Fs + cW 3 dWi — Y. 


( 1 ) 



STANDARD ERROR OF ESTIMATE 645 


Multiplying throughout by v, and summing, we have 

S(«») = a S(«lf i) + 6S(t;F2) + cSCoTTs) + dSC^TF*) - S( 7 i;). (2) 

But 

S(4)lf,) = 0 
= 0 

S(j;prs) = 0 
S(t)F4)' = 0 

therefore, 

S(i;*) = - S(F«). ( 3 ) 

Multiplying each equation (1) throughout by Y, and 
adding, we have 


S(F») = oS(Fi 7 ) + bZ(WiY) + cS(lFsF) + dLiW^Y) 

-S(n. (4) 

Substituting in (3) the equivalent of 2(Fv), we have 

S(t»*) = 2(7)* - a2(P7,7) - 62(^27) - 02(1737) 

-d2(P747). (6) 

This gives us a method of obtaming the value S(tf*) 
without computing the separate residuals, a method that 
is applicable whenever the equation of the curve to be 
fitted is of the form, or may be reduced by the use of loga- 
rithms, reciprocals, or other manipulation to the form 


7 = aT7i -|- 5172 “h cWg -j- dW^. 

In applying this to a particular case it is necessary only to 
replace Wi, Wt, Wi, Wi, etc., by the functions that actually 
appear as coeflBicients of the unknown quantities in the 
original equation. Thus in fitting a curve the equation to 
which is 

7 = 0 -1- -I- cX* -h dX\ 

we find, as noted above, that 


Wi = 1 

17* *X 
17s = 

17i = X*. 
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Making these substitutions in equation (5) above, we have 
S(t;*) = S(r*) - aS(I0 - b2(XY) - cS(X^V) - d2(Z»F). (6) 

The standard error, Sy, is derived from the equation 


where d is used to represent a deviation from a fitted curve. 
The deviation, d, then, is but another term for the resid- 
ual V. Accordingly, as a general expression for the standard 
error of Y, with Wi, Wj, Wi, and Wt as independent 
variables, we have 

„ , Sr* - aS(FiF) - 6S(F*I0 - cS(WiY) - dS^iWiY) 

W = y‘) 

As in the previous case, this may be applied to a particular 
problem by replaciug Wi, Wt, W%, Wt, etc., by the actual 
coefficients of the unknown quantities. 


DERIVATION OP THE FORMULA FOR THE INDEX OP 
CORRELATION 

We have adopted as an index of the degree of correlation 
between two variables the measure p (rho), derived from 
the equation 

= 1 - 1 *^* ( 8 ) 

assuming a single dependent variable, 7, and a single inde- 
pendent variable, Z. With a single dependent variable, F, 
and a number of independent variables, Wi, Wi, Wt, Wt, 
the eiqiression mi^t be written 

PVwiwtvim ~ 1 (9) 

* Since our object is to measure the actual ** scatter” about the fitted curve, 
the formula is used, rather than the formula "y (where N repre- 
sents the number of observations and Nc the number of constants in the equa- 
tion to the fitted curve). The second formula would be used, in accordance 
with the theory of least squares, if we were seeking to determine the mean 
square error of an observation or of an observational equation. 
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Corresponding changes would be made in the subscripts for 
other changes in the symbols employed. The expression 
above is equivalent to 


P 


2 


SOT 

S(y*) 


where y represents a deviation from an origin at the mean 
of the F’s. But 

S(y*) _ SCF*) _ , 

N ~ N 


where F represents the original values of the F-variable 
and Cy represents the difference between the original origin 
and the mean of the F’s. (The symbols Cy, and c* should 
not be confused with c, one of the constants in the equation 
to the fitted curve.) 

Accordingly, we have 


p 


2 

V>10Xt0ktO|fO4 


S(F*) - Ncy^' 


( 10 ) 


But we have secured an expression for 2 (i;*) [the equivalent 
of 2 (d*)] which holds in the case of a curve fitted by the 
method of least squares. Substituting the equivalent of 
2 (d*) in the above equation, and simplifying, we have, 
as a general formula for the index of correlation 




dZjWtY) + &2(FiF) + cZjWiY) + iiZWiY) + 


( 11 ) 

-Nc/ 


2(F*) - Ncy^ 


This may be applied to a specific case by replacing Wt, 
Wi, Wz, Wt, etc., in the above formula by the functions 
which appear as coefficients of the unknown quantities in 
the original equation. When all these are fmctions of a 
sin^e independent variable, as in the usual case, the index 
of correlation would be represented by the symbol py,. 


CERTAIN SPECIAL CASES 

In the case of multiple correlation, where the symbols 
Xi, Xs, Xs, X 4 , etc., are used to represent aU the variables. 
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whether considered dependent or independent, the symbol 
R is employed for the measure of correlation and numerical 
subscripts utilized as described in the body of the text. 

In the case of a straight line relationship between two 
variables, p is replaced by the symbol r, which represents 
the ordinary coefficient of correlation. As the general 
equation for r we have 

, aS(F) + 6S(Zy) - Ncy^ 

^ ~ S(7*) - 

There are two special cases in which this formula may ]ye. 
simplified. If the origin be at the mean of the A's, we 
have 

SF 

® ~ ^ ~ jy 

a - 

Ncy^ = dSY 
and the formula for r reduces to 


~ S(F*) - NCy^' 

If the origin be at the mean of the F’s (it is not essential 
that it be also at the mean of the X’s) 

2(y) = 0, and 0 

and the formula for the coefficient of correlation becomes 


mxy) 


In this latter case the general formula for p may also be 
simplified by the elimination of the terms aS(j/) and NcyK 


CHECKS ON THE POBMATION OF THE NORMAL EQUATIONS 

There are so many possibilities of arithmetical error in 
the formation and solution of a set of normal equations 
that checks should be employed wherever possible. A 
convenient check on the calculations leading to the noimal 
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equations is afforded by the introduction m each observation 
equation of an additional term, s, equal to the sum of all 
the known quantities in that equation. Thus, in the following 
system of observation equations, formed in fitting a line 
to the points 1, 3; 2, 4; 3, 6; 4, 5; 5, 10; 6, 9; 7, 10; 8, 12; 
9, 11, the values of s are as indicated: 

s 

3 = ffl “b 16 5 

4 = a + 26 7 

6 = o “t" 36 10 
6 = a + 46 10 

10 = o + 56 16 
9 = a + 66 16 

10 = a + 76 18 
12 = a + 86 21 

11 = a + 96 21. 

(The coefficient of a in each case is 1, and this is added to 
the other known quantities.) 

In fitting a cxxrve described by the type equation 

Y = aWi + 61^8 + cWt + dWi 

the following relations prevail between s and the other 
quantities computed. For each observation equation. 


F + TFi + TF2 + TFs + TF* = 8. 

For the normal equations, 

SCFiF) + S(lFi») + SCFiFs) + SfFiFs) +2(TFiTr4) = S(TFis) 
2(F2F) + 2(mTr2) + 2( W) + + 2(F2F4) = 2(F2«) 

S(FsF) + 2(FjF,) + 2(F2F») + 2(^8*) +2(F8F4) = 2(^,8) 

^(WiY) +2(FiF 4) +2(FaF4) + 2(F8F4) + 2(^4®) = 2 (^ 48 ) 

This form is capable of application to any specific problem. 
In each case the s-equations are formed in precisely the 
same way as the corresponding normal equations. 

In applying these checks several additional columns are 
needed in the working tables, but the extra trouble is 
more than compensated by the opportunity to check the 
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work at each stage. The application is illustrated in the 
following working table, showing the calculations involved 
in fitting a second degree curve of the form 

F = o + + cX® 

to the nine points 1, 2; 2, 6; 3, 7; 4, 8; 5, 10; 6, 11; 7, 11; 
8, 10; 9, 9. 


Table A 

IllvMraiing the Use of Checks on the Formation of Normal Equations 


Y 

X 

X* 

XY 

X*F 

s 

Xs 

Xh 

2 

1 

1 

2 

2 

5 

5 

5 

6 

2 

4 

12 

24 

13 

26 

52 

7 

3 

9 

21 

63 

20 

60 

180 

8 

4 

16 

32 

128 

29 

116 

464 

10 

5 

25 

50 

250 

41 

205 

1,025 

11 

6 

36 

66 

396 

54 

324 

1,944 

11 

7 

49 

77 

539 

68 

476 

3,332 

10 

8 

64 

80 

640 

83 

664 

5,312 

9 

9 

81 

81 

729 

100 

900 

8,100 

74 

45 

285 

421 

2,771 

413 

2,776 

20,414 


(Columns for X* and X* are omitted, as the values SCX®) and Z(X*) may be 
derived from prepared tables.) 


Each of the values in the column headed s is secured 
from the corresponding observation equation. Thus, from 
the first observation equation 

2 = Id 16 -|- Ic, 

we have 5 as the value of $ (2, plus the coefficients of the 
three constants). These values of s are secured readily 
from the table by adding the figures in the columns headed 
Y, X, and X®, plus 1, the coefficient of the constant term a. 

Adding the various columns, the arithmetic work is 
verified by the following checks: 

s(y) + + 2(Z) + 2(x*) = 2(«) 

74 + 9 + 45 + 285 = 413 

2(X7) + 2(X) + 2(X*) + 2(X*) = 2(Xs) 
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421 + 45 + 285 + 2,025 = 2,776 
S(Z*7) + S(Z*) + S(Z») + S(X<) = S(X*8) 

2,771 + 285 + 2,025 + 15,333 = 20,414. 

Further uses of a check of this kind are explained below, 
in discussing the solution of the normal equations. 

OTHEB TESTS 

The possibility of checking the calculations in other ways 
has been suggested in the preceding sections. Thus, where 
the coefficients of the constants in the equation to the 
fitted curve are represented by Wi, Wa, Ws, W*, we know 
that 

= 0 

S(»TF2) = 0 
SCoTFs) = 0 
= 0 . 

If a curve of the type 

Y = a + lX + cX^ + dX* 

has been fitted, this means that 

S(v) = 0 
S(»Z) = 0 
S(t)Z*) = 0 
S(»Z») = 0. 

The accuracy of the work may be tested by checking these 
relations. 

Finally, we may test the accuracy of the work by com- 
puting the standard error of estimate in two different ways. 
We may compute the separate residuals by taking the 
difference between computed and actual values of the 
dependent variable, and from these values determine 8. 
This may be compared with the results secured by applying 
the general formula for the standard error, as derived above. 
In the fitting of the second degree curve, the data of which 
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were used to illustrate the method of checking the normal 
equations, the equation derived was 

Y = - .92860 + 3.62316X - .267316Z®. 

From the residuals separately computed, we have 

Sy = .4941. 

From the formula 

„ , _ S(F») - aS(F) - 6S(X7) - cSiX^Y) 

Sy . 


we have 


8y = .4947. 

This constitutes a final check upon the accuracy of the 
calculations. 


SIMPLIFICATION OP NORMAL EQUATIONS IN A MULTIPLE 
CORRELATION PROBLEM^ 

In the discussion of multiple correlation procedure in 
Chapter XVI the normal equations as first derived in the 
form 

I S(Zi) = Na + hi 3iS(Z2) + 6 i3 mS(Z,) + 

II SCZiZa) = a2(Z0 + hn + bii^X(XiX») 

+ bn33E(XiX^ 

ni SCZxZs) = aS(Z,) + 6 u.s4S(ZsZ8) + 

+ 6i4.Sj2(Z8Z4) 

IV S(ZiZ4) = aS(Z«) + 5 is8iS(ZjZ4) + bx»:2^XiX^4) 

+ 5x4.a32(Z4®) 

were reduced in number and modified to facilitate their 
solution. Details of the method are here given. 

Letting Ai, As, A*, and A 4 represent the arithmetic 
means of the several variables, and Xi, xi, Xt, and Xt represent 
deviations from the means, we may replace the variables 

* A^pted from H. R. Tofley and M. J. B. Ezekiel, “A Method of Handling 
Multiple Conelation Problems,” Journal of the American StatMcal Aaaocia- 
Um, Vol. 18, 993-1003. 
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Xi, Xs, Xa, and X4 by their equivalents Xi + Ai, Xa + As, 
Xs + As, X 4 + A 4 . The normal equations now become: 

I 2(xi + Ai) = JVa + 2(xs + As) • 612 34 + 2(xa A 3 ) • 613 24 
+ S(a:4 + A4) • hi4M 

II S[(a:i + AMxs + A*)] = 2[(xs + As)-a + 2{xs + A*)®] • 612.34 
+ S[(xs + A 2 )(a :3 + A3)] • 613.24 
+ 2 {xs + A 2 )(a :4 + A4) • 61423 

III S[(3;i + Ai)(a:8 + A 3 )] = S(a:s + A 3 ) - a 

+ S[(a:3 + A3)(ai + A2)] • 61234 + 2 (xa + A3)® • 613.24 
+ S[(a :3 + A 3 )(x 4 + A 4 )] ■ 614.23 

IV S[(ai + Ai)(a :4 + A 4 )] = S(a ;4 + A 4 ) • a 

+ S[(a :4 + A4)(a52 + A 2 )] • 612.34 
+ S[(a :4 + A 4 )(a :3 + As)] • 613.24 + S(a :4 + A 4 )® • 61453 . 

Since S(a;i + Ai) = Sa:i d-iVAi, and since Sxi = 0, 2(xi + Ai) 
and all similar expressions may be replaced by NAi, NAs, etc. 

If we expand S(a:2 + A2)® to Hixs^ + 2A2a:2 + A2*), the 
middle term drops out, because 2xs = 0, and the expression 
may be written 2xs^ + NAs^. The smns of aU similar 
squares may be put in similar form. 

The product sum S(a:i + Ai)(x2 +A2) = S(a:iX2 4-Aia52 
+ A^i + AiAi) = 2xiXs + iVAiAs since 2xi = 0 and 21*2 
= 0. Product sums of the same type may be similarly 
modified. The normal equations now take the form: 

I NAi = Na “I" NAs6i2.34 4* NA 361854 4" NA^423 
II 2(*i*2) 4- VA 1 A 2 = NAsa 4- [S(x 2 )® 4- VA 2 ®] 6 i ,.,4 

4" [2(a?2a:s) 4" iV’A2A3]6i854 4" [S(a; 2 a! 4 ) 4“ iVA2A4]6i458 

III 2(!XiX,^ 4" VAiA$ = NAact 4" PC^ssjs) 4" VA2A8]6i2.84 

4- 12 ( 3 : 3 )® 4" iV’A8®]6l854 4" [ 2 ( 3 : 83 : 4 ) 4" VA3A4]6i458 

IV 2!(xi3:4) 4* NA 1 A 4 — iVA4<i 4* [2(3i3;4) 4" A^A2A4]6i2.84 

4 - [2(®83!4) 4* NA3A4]6i 8.24 4" CS(»4)* 4" ■VA4®]6i45S. 

If we now divide through by N, and substitute pu for 
<r2® for \ and similar symbols for other mean 
products and mean squares, the normal equations become 
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I Ai = a + As&i 2 .g 4 + AJiltM + ^4614^3 
II Pl* + A1A2 = A20 + ((r2* + A2*)6 i2 34 + (P2S + A 2 A 3 )& 13^4 
+ (jpat + ^2^4)614^3 

III P18 + A1A3 = Asa + (P 23 + A2As)&12 34 + (ffs* + A 3 *) 6 l 3^4 

+ (P84 + AtAi^hiiat 

IV Pl4 + A1A4 = A4a + (P 24 + A 2 A 4 )& 12.84 + (Ps4 + A 8 A 4 ) 6 l 3 24 
+ (0'4* + A4*)6i4J8- 

These four simultaneous equations may now be reduced 
to three. We multiply equation I, throughout, by A2, 
and subtract the result from equation II; we then multiply 
equation I by A8, and subtract the result from equation III; 
we then multiply equation I by A4, and subtract the result 
from equation IV. All the terms containing A’s axe thus 
eliminated and we obtain the three normal equations 

J>J2 = a2*6l2.34 + + P24l>14.2S 

?>18 = P23612.34 + O’ 3^613^4 + 23 

Pl 4 = P 24 & 12.34 + P84613J4 + 0 ’ 4 *& 14 J 8 - 

Inserting the observed values of the p’s and the o-’s, these 
axe solved for the coefficients h. The value a may then be 
obtained by inserting the values of the A’s and the 6’s 
in the equation 

Ai = o + A2&12.34 + A3613J4 + A4514.33. 

SOLtmON OF THE NORMAL EQUATIONS 

The task of sol-dng the normal equations is not a difficult 
one in most of the cases presented to the economic statis- 
tician. If there axe only two or three unkno’wns the corre- 
sponding number of normal equations may be solved by 
simple algebraic methods. Even with three equations, 
however, it is advisable to employ a systematic procedure, 
and with more than three equations this is imperative. 
Such systematic methods of sol'ving the simultaneous equa- 
tions which axe met 'with in connection with the method 
of least squares have been worked out by Gauss and by 



THE DOOLITTLE METHOD 


655 


Doolittle. The latter method, which is perhaps the more 
convenient for general visage, is demonstrated below. 

The coefficients of the unknowns in the normal equations 
are always symmetrical with respect to the principal diago- 
nal. Thus in securing the most probable values of the 
constants in the equation 

Y = aWi + bWi -f- cWa + dWtf 
we have the four normal equations 

aS(Fi*) -I- hSiWiWi) + cS(FiF,) -i- d2,{WiWi) - 'L^WiY) * 0 
aS(FiFs) + 6S(Fs*) -|- cSCFaF,) + dS(F2F4) - S(FsF) = 0 
oSCFiFs) + 6S(FiFs) -f- cS(F,*) + dS(WiW^ - ^{WiY) = 0 
aS(FiF0 + 6 S(FjF4) + cLQVzW^) + dS(F4*) - 2(^47) = 0 

The symmetrical arrangement about the diagonal, when 
F-terms are neglected, is obvious. Starting with any term 
on the principal diagonal, we have the same coefficients 
directly above as to the left. Thus, above the diagonal 
term in which the coefficient S(TF»®) appears, we have the 
coefficients 2(17*17*) and SCFiTT*). The same coefficients 
are fovmd to the left of the given diagonal term, and on 
the same line. For the purposes of solution, therefore, 
the terms to the left of each diagonal entry may be omitted, 
and we may put the remaining terms of the normal equations 
in the form 

aS(Fi*) + 6S(FiF*) + cS(FiF*) + dS{WiW^ - S(FiF) 

+ 62(F2*) -I- c 2(F2F*) + d2(F2F*) - 2(F*F) 

+ c2(F**) ■+• dSiWtWO - 2(F*F) 

+ d2(F4*) - 2(^47). 

THE DOOLITTUB! METHOD 

The Doolittle method may be illustrated with reference 
to the following normal equations: 

8.3664a -|- 2.7906 -h 2.932c -I- 47.967 = 0 
2.790a + 6.66455 + 2.063c + 62.039 = 0 
2.932a + 2.0636 + 7.7893c + 47.619 = 0. 
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Putting these, for the purposes of the solution, in the 
abbreviated form given above, we have 

8.35640 + 2.7906 + 2.932c +47.967 
+ 6.66456 + 2.063c +62.039 
+ 7.7893c +47.519. 

We wish to solve these for the constants o, 6, and c. All 
the work of computation, with the necessary checks, is 
shown in the following table: 

Table B 


Solution of Normal Equations by the Doolittle Method 




(2) 

(3) 

(4) 

(5) 

(6) 



a 

6 

C 


S 

I 


8 3564 

2 790 

2 932 

47.067 

62 0454 

n 



6 6646 

2.063 

62.039 

73 6565 

III 




7.7893 

47.610 

60 3033 

1 


8 35640 

2 790 

2 032 

47 967 

62 0454 

2 

- .11966876 

- 1 00000 

- .333876 

- 350860 

- 6 740151 

- 7 424896 check 

3 



6 6645 

2 063 

62 030 ‘ 

73.5665 

4 



- 931614 

- 978924 

- 16.016030 

- 20 715470 

5 



5 732986 

1 084076 

46 023070 

62 841030 check 

6 

- .17442917 


- 1 000000 

189094 

- 8 027923 

- 9 217017 check 

■ 

7 




7 7898 

47.519 

60 3033 

8 




- 1 028748 

- 16 830133 

- 21 769807 

9 




- .204992 

- 8.702867 

- 0.991922 

10 




6 556660 

21 986010 

28 541671 check 

11 

- .16264227 



- 1 000000 

- 8.368796 

- 4.353796 check 


Back Solution 

c h a 

- 8.353796 - 8 027923 “ 5 740161 

- 3 868706 + .634183 + 2 468692 

— 7 393740 + 1.176743 

- 2.094816 

a = - 2.094816 
fc = - 7.393740 
c = - 3.363796 

Check: 

Equation I: 

8.35640 + 2.7906 + 2.932c = - 47.967. 
Substituting the given values, 

8.3564(- 2.094816) + 2.790(- 7.393740) 

+ 2 . 932( - 3 . 353796) =* - 47 . 966985. 









THE DOOLITTLE METHOD 


657 


Explanaiion . — The coefficients of the unknown quan- 
tities, a, b, and c, are listed in the designated columns. 
The known term in each normal equation is listed in col- 
umn (5). (The sign of this known term, it should be noted, 
is that which it would have when the entire expression, of 
which it is one term, is equated to zero.) Column s is 
employed as a check. The value in column s, in each of 
the lines I, II, and III, is the algebraic sum of the known 
values in the given normal equation. In securing this 
sum the coefficients to the left of the diagonal, which have 
been omitted from the table as it stands, must be included. 

The following is a summary of the procedure in solving 
the normal equations: 

1. In line (1) write normal equation I. 

2. In line (2), colunm (1), write the reciprocal of the value in 
line (1), column (2), with sign dumged. (This is the reciprocal of the 
coefficient of a.) Multiply each item in line (1) by this reciprocal, 
entering the products in the corresponding columns in lino (2). 
[The algebraic .sum of the items in columns (2), (3), (4), and (5) of 
line (2) should equal the value in column (6).J This operation has 
eluninated the unknown a, by expressing it in terms of b and c. 
[The — 1 in line (2), column (2), has been included only to facili- 
tate the checking process. The same is true in lines (6) and (ll).J 
A heavy line may be drawn across the table below line (2) . 

3. Write normal equation II in line (3) . 

4. Multiply by the coefficient of b in line (2) (i.e., — .333876) 
the items in columns (3), (4), (5), and (6) in line (1). Enter the 
products in the corresponding columns of line (4). 

5. Add lines (3) and (4), entering the sums in line (6). [The 
algebraic sum of the items in columns (3), (4), and (5) of lino 
(6) should equal the value in column (6).] 

6. In column (1), line (6), enter the reciprocal of the value in 
column (3), line (6), reversing the sign. Multiply each term in line 
(6) by this reciprocal, entering the products in line (6). [The sum 
of the items in columns (3), (4), and (6) of line (6) should equal the 
value in column (6).] This operation has eliminated the unknown 6, 
by expressing it in terms of c. A heavy lino may be drawn across 
the table below line (6). 

7. Write normal equation HI in line (7). 

8. Multiply by the coefficient of c in line (2) (i.e., — .350869) 
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the items in columns (4), (5),and(6) of line (1). Enter the products 
in the correspondiag columns of line (8). 

9. Multiply by the coefficient of c in line (6) (i.e., — . 189094) 
the items in columns (4), (5), and (6) of line (5). Enter the products 
in the corresponding coliunns of line (9). 

10. Add lines (7), (8), and (9), entering the sums in line (10). 
[The algebraic sum of the items in columns (4) and (5) of line 

(10) should equal the value ia column (6).] 

11. In column (1), Ime (11), enter the reciprocal of the value in 
column (4) of line (10), reversing the sign. Multiply each term in 
line (10) by this reciprocal, entering the products in line (11). 
[The algebraic sum of the items in columns (4) and (5) of line 

(11) should equal the value in column (6).] This operation gives 
the value of c, which is found in column (5) of line (11). A heavy 
line may be drawn across the table below line (11). 

Were there additional unknowns, as d and e, this last 
operation would have given c as a fimction of d and e and 
it would be necessary to carry the process stiU further, 
repeating the steps taken above. The next operation would 
be to bring down the fourth normal equation, entering it 
in line (12). Then the coefficients of d in lines (2), (6), and 
(11) would be used to multiply the necessary items in 
lines (1), (5), and (10), the products being entered in lines 
(13), (14), and (15). The sum of the items in lines (12), 
(13), (14), and (15) would be entered in line (16) and 
checked by the item in the s column. Multiplying tlurough 
by the reciprocal of the coefficient of d in line (16), with 
sign reversed, the value of d would be obtained in terms 
of e. The value of e would be derived in a similar fashion. 

The checks on these various operations have been indi- 
cated in the table. The testing of the results at each step 
reduces the possibility of error to a TniniTruim- 

The back solution presents no difficulties. We have, 
from line (11), 

c = - 3.353796, 
from line (6) 

6 = - .189094c - 8.027923, 
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from line (2) 

a = - .3338766 - .350869c - 5.740151. 

(The items in column (6) axe inserted merely as checks. 
The items — 1.000000 which appear in lines (2), (6), and 
(11) axe inserted to assist in the checking.] 

The computations involved in the back solution appear 
in the table. 

A final check is afforded by inserting the values secured 
by this process in one of the normal equations. This check, 
as carried out for equation I, is shown below the table. 
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APPENDIX B 


DERIVATION OF FORMULAS FOR MEAN AND 
STANDARD DEVIATION OF THE BINOMIAL 
DISTRIBUTION ^ 


For convenience we put the binomial in the form (j + p)”, 
where q = probability of a failure, p = probability of a 
success, and q + p = 1. Expanding the binomial, we have 

(g + p)» = g« + ng’-'pi + 7^ — 


+ 


nj n - l)(w - 2) 3 


1 -2 -3 


qit-tpi _j_ 


+ p". 


The terms of this expansion indicate, in order, the probable 
frequencies of no successes, 1 success, 2 successes, 3 suc- 
cesses, and so on, to n successes. A frequency table of the 
familiar t 3 q)e may be constructed from these materials. 

The items in col. (2) of Table C constitute the terms of 
the binonsial expansion. Their sum is thus equal to (g -I- p)“, 
which is, by definition, equal to 1. The items in col. (3), 
added in order, give 

ngCo-Dpi -|- fi{n — ^ g”-»p» 


+ 


n(n — l)(n — 2)(n — 3) 
1 -2 -3 


g"-4p4 + npn 


Since the factors n and p appear in each of these terms, tbia 
reduces to 


-I- (n — l)(g"~®pO -j- 

(n — l)(?i 


(n — 1)(» — 2) 


qn—Zpi 


+ 


1 -2 

gn-4p» 


1 -2 -3 

1 These derivations are adapted from the proof given by D. C. if ones in 
A First C&nrse in Statiatics, Loudon, Bell & Sons, 1921, 143*145. 
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Table C 

Demotion of Mean and Standard Deviation of the Binomkd DistrUndion 



001 
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But the terms within brackets, following np, represent the 
expansion of the binomial (q 4- Since g + p = 1, 

the sum of these terms is 1. Accordingly the sum of the 
items in col. (3) reduces to 

np(g + = np. 

For the Tnfta.n of this distribution we have 


^ = np. 

2 (/) 1 ^ 


Adding the items in col. (4) in order, we have 
ng»->pi + 2n{n - l){r-*p* + ~ ~ ff^»p* 



_l_ 4w(n - l)(n - 2)(ra - 3) 
1 • 2 • 3 


+ n’p" 


+ 2(n — l)g"~*p* + 


3(« — 1)(» — 2) 


1 -2 


q^*P‘ 




4(w — l)(n — 2)(n 


1 •2*3 


— g»-^* + . . . + np"^' 


} 


The terms within brackets may be broken into two groups, 
giving 

npj^ 1 5^"* + (n - l)<r"*pi + — ■ 

+ I C» - ~ g*'V 

The t^rms within the first of these two groups constitute 
the expansion of the binomial (g + p)*~K These terms 
may be replaced by that binomial; the second group of 
terms may be simplified, since they contain the common 
factors n — 1 and p. These operations give us 
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+ ?>)"“‘ + (n — l)p I + (« — 2)5 ^*p‘ 

_l_ {n — 2)(w - 3) g,„_4p2 ^ p „-2 1 J. 

The second group of terms, thus simplified, is seen to be 
{n — l)p multipUed by the expansion of the binomial 
(? + Thus we have, as the sum of the items in 

col. (4) of the preceding table, 

np[{q + p)”“* + (n — l)p(? + p)""*]. 

But since g + p = 1, (g + p)*“^ = 1 and (g + p)"~® = 1. 
Accordingly, the total of col. (4) becomes 

«p[l + p(n — 1)]. 

As a general formula for the standard deviation, in 
squared form, we have 



where c is the difference between the mean of the distribu- 
tion and the arbitrary origin. In the present instance, 
the origiu is at 0, or “no successes,” and c is equal to the 
mean, or np. N is equal to 2(/), or 1, in this case. Thus 
the standard deviation of the binomial distributions given by 

= wp[l + p(n — 1)] — ra®p® 

= np[np + (1 — p)] — n*p* 

= n®p® + np(l — p) — n*p* 

= np(l - p) 

=s npq 
a = Vnpq. 
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DERIVATION OF THE STANDARD ERROR 
OF THE ARITHMETIC MEAN 

We have made n random, hence independent, observations 
on a given variable. The respective observations may be rep- 
resented by Xi, Xi, Xz . . . Xn. Representing the sum of 
the n observations by W, we have 

TF = Zi + Xs 4- X, 4- . . . + X,. (1) 

Additional samples are now taken until we have N values of 
Xi, N values of Xa, etc., and hence N values of the sum, W. 
We have N samples, therefore, of n observations each. The 
mean values, which we may represent by barred letters, 
stand in the same relationship of equality: 

F = Xi-hX8 4-Xa-h. . .-bX„. (2) 

Using small letters {w, xi, Xz, etc.) to define deviations of 
the actual observations from these mean values, we may 
write, for any given sample, or series of observations, 

W - Xi + Xi + Xs + ...+ Xn. ( 3 ) 

Squaring the two sides of this equation, we have 

w* = *1* + a;** + x** -f- . . . + Xn* + 2xiXt -f- 2xiXz 4- • ■ . 

+ 2xix„ -I- 2a:ja:» -1- . . . + 2*80:* 4- • • ■ 

•4 2xzx,t + ... . (4) 

Each term on the right-hand side of (3) will appear in squared 
form in (4), and there wiU also appear product terms of the 
form 2 xiX 2 corresponding to all possible pairings of the terms 
on the right-hand side. 

The next step involves the summation of the equations 
of type (4), derived from the N samples, and division 
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throughout by N. Each product term, when thus summed 
and divided by N, will be of the form 

2SxiXs 

N 


This, with the modification introduced by the factor 2, re- 
sembles the familiar mean product, encoimtered in cor- 
relation procedure. This mean product, we have seen, has a 
value of zero when the variables x and y are uncorrelated. 
But, by hypothesis, the observations that have given us 
Xi, Xif Xz, etc., are independent of one another, and hence 
these variables are imcorrelated. Accordingly, each of the 
product terms, derived when N equations corresponding 
to (4) above are smmned and divided by N, is equal to zero. 
The process of summation and division gives us, therefore. 


N 


Sxi* 

N 


4- -L 

^ N ^ 


JV 


+ . 




(5) 


or 

<Tu = -|- Vj* -|- <1^3* -|- • • • + Vn*. (6) 

If all the observations relate to the same universe (i.e., if 
the samples are all drawn from the same parent population), 
which is true, by hypothesis, the standard deviations appear- 
ing in the right-hand member of equation (6) are equal to 
one another and to the standard deviation of the population. 
Accordingly, using <r to represent that standard deviation, 
we have 

ffw* = lUT. (7) 

The next argument, that leads directly to the desired 
measurement, follows precisely these steps, which have been 
given in the above form to indicate the reasoning involved. 
It starts, however, with a variant form of equation (3). 
Dividing that equation throughout by n, we have 

If + +iL». 

n n n n ‘ n 


( 8 ) 
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Working with the variables — , — , — , etc., just as we have 

done with w, xi, xi, etc., we may go through the operations 
represented by equations (4), (5), and (6), above. The 
product terms disappear, as in passing from (4) to (6). In 


the process of squaring the term - is treated as an entity; 

the sum of the squared values is thus S • Numerator and 

denominator of each of the terms of type — are squared 

n 


25^2 

separately, however, and the sum is of the form — Division 

Th 

throughout hyN then gives the quantities appearing in equa- 
tion (9), which corresponds to equation (6). 


/T * - -4- -I- 4- 


+ — • 


(9) 


Since all observations relate to the same universe, this re- 
duces to 



From this 


( 10 ) 



» 



( 11 ) 


But w is the sum of n observations drawn from a universe 


having a standard deviation of <r, and - is the mean of these 

u 

observations. <r« is the standard deviation of a distribution 

n 

of arithmetic means, corresponding to the familiar S 3 anbol 
<rjf. This is the desired egression for the standard error 
of the arithmetic mean, appropriate for use when the <r 
of the population is known. Where cr is estimated from the 
standard deviation of a sample, accuracy is increased by 
using Vn — 1 rather than Vn in the denominator of the 
ri^t-hand member of (II). 
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ILLUSTRATING THE MEASUREMENT OF TREND 
BY A MODIFIED EXPONENTIAL CURVE, A GOM- 
PERTZ CURVE AND A LOGISTIC CURVE 

The discussion in Chapter VII of mathematical functions 
suitable for use in measming the secular trends of time series 
dealt with types required in ordinary practice. We here 
discuss briefly three other types suited to the measurement 
of long-term movements in economic and business series. 

The Modipibd Exponential Cubve 
An exponential curve, which plots as a straight Ime on 
ratio paper, is a suitable measure of trend for a series that 
is increasmg or decreasing at a constant rate, that is, one 
that shows constancy of relative growth. The figures defin- 
ing the successive trend values of a series of this t3rpe con- 
stitute a geometric progression. The trends of certain eco- 
nomic series that depart from constancy of relative growth 
may be accurately defined by a simple modification of the 
exponential curve. This is the case when the observed 
values may be transformed, by the addition (or subtraction) 
of a constant magnitude, to a series closely approximating 
such a geometric progression. 

If we represent by K the constant magnitude that is to 
be added (algebraically) to each observed value in effecting 
the desired transformation, the taidr of fitting the trend line 
involves the following steps: 

Determinatiou of K. 

Correction of observed values by K, to obtidn the modified series. 
Fitting an exponential curve to the modified series, and computa- 
tion of trend values of the modified series. 

667 
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Correction of trend values of the modified series by K to obtain 
trend values of original series. 

If y represents the ordinates of trend of the original series 
and X represents time, the equation to the desired line of 
trend may be put in the form 

y = ab” — K 

where K is the correction factor noted above and a and b 
are constants to be determined by fitting an exponential 
curve to the modified series. The procedure may be illus- 
trated with reference to the data in Table D. 

Table D 

Illustrating the Fitting of a Modified Exponential Curve 
Production of Rayon Filament Yam in the United States, 
1920-1931 1 

(Data in thousands of pounds) 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Year 

Original 

aeries 

{observed) 

Group 

mean 

Modified 

series 

Trend valves, 
modified 
series 

Trend values, 
original 
series 




{2) + K 


{S)-K 

1920 

10,125 


27,669 

29,108 

11,564 

1921 

14,986 

24,067 

Jltf, = 

32,530 

34,363 

16,819 

1922 

21,034.25 

41,611 

40,565 

23,021 

1923 

34,959 


52,503 

47,888 

30,344 

1924 

36,328 


53,872 

56,532 

38,988 

1925 

51,049 


68,593 

66,736 

49,192 

1926 

62,693 

56,406.25 

80,237 

78,782 

61,238 

1927 

75,555 


93,099 

93,003 

75,459 

1928 

97,232 


114,776 

109,790 

92,246 

1929 

121,399 


138,943 

129,608 

112,064 

1930 

127,333 

124,210.76 

144,877 

153,003 

135,459 

1931 

160,879 


168,423 

180,621 

163,077 


In employing this method we approximate K empirically 
by breaking the observed series into three parts, represent- 
ing equal periods of time, and determining the mean of the 

^ Data from Textile Economics Bureau. 
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observations for each period. We may designate these means, 
in chronological order, by Mi, Ms, and Mg. The desired 
value, K, is given by 

K = - (Ml X Mg)] ^ [(Ml + Mg) - 2 M 2 ]. 

If the observed series constitute a geometric progression 
the value of K will be zero; if the addition of a constant 
magnitude to the members of the original series will yield 
a series approximating a geometric progression, K will be 
positive; if the subtraction of a constant amount from the 
observed values will yield a series approximating a geometric 
progression, K will be negative. (In practice, K is given 
the sign obtained by the employment of the method de- 
scribed above, and then added algebraically to the observed 
series.) 

In the present case we have 

K = [(56,406.25)* - (21,034.25 X 124,210.75)] 

- 5 - [(21,034.25 + 124,210.75) - (2 X 56,406.25)] 

= -I- 17,544. 

Adding this amount to each of the values recorded in 
col. (2) of Table D, we obtain the modified series in col. (4). 
In fitting an exponential curve to the modified series, it is 
desirable to use logarithms, that is, to solve the constants 
in an equation of the type log y = log a (log b)x. This 
procedure was explained in Chapter VII. For log a of this 
curve we obtain 4 . 824359, and for log b . 072068. (The origin 
is at 1925.) The antilogarithms of the series of trend values 
thus obtained are given in col. (5). These define the trend of 
the modified series. Subtracting K (algebraically) from these 
values we obtain the trend values of the original series, which 
appear in col. (6). 

The original series measuring production of rayon filar 
ment yam and the modified exponential curve fitted to this 
series are shown graphically in Fig. A. 

It is essential that the three M’s used in the deteiminar 
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Year 

Eig. a. — Total Production of Rayon Klament Yam in the United States, 
1920-1931, with Modified Exponential Trend 

tlon of K relate to equal numbers of observations and that 
the midpomts, in time, of the three periods be equidistant. 
In the above example the number of years included in the 
period is a multiple of three, and no difficulty arises. If the 
number of years included is not a multiple of three, intervals 
that overlap slightly may be employed. For example, if our 
series had run from 1920 to 1932, the three averages might 
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have been derived, respectively, from the five-year periods 
1920-1924, 1924^1928, 1928-1932. These would center, re- 
spectively, at 1922, 1926, and 1930, and would thus be equi- 
distant in time from one another. Alternatively, if monthly 
data are available, division of the total period into three 
equal parts may be facilitated by using a time-umt of 4 or 
8 months, rather than 12 months. 

The Gompbbtz Curve 

The Gompertz curve, which has important uses in actu- 
arial science, has had some application in the study of eco- 
nomic and business trends. The term "growth curve” is 
applicable to it, since it portrays a process of cumulative 
expansion to a Tnfl.ri-miiTYt value. This expansion proceeds 
by decreasing absolute increments in the later stages, but 
continues to the end without retrogression. It may not be 
assumed that this form of growth is typical of all industrial 
development, but the curve has value as an empirical rep- 
resentation of certain trend movements. 

For the purpose of fitting, the equation to the curve is 
transformed from the natural form 

y = a6®“ 

to the logarithmic form 

log y = log a + (log &)c*. 

When fitted to an appropriate set of observations, measur- 
ing the e:!ipansion of an industry or the growth of an eco- 
nomic element, log a is the logarithm of the maximum value 
— the ceiling that the curve approaches. The second term 
measures the amoimt by which the trend value at a given 
time falls short of this maximum, an amount that diminishes, 
of course, with the passage of time. (The series for which 
this curve is an appropriate measure of trend will be ex- 
panding by decreasing amounts in the later stages of the 
period covered, and c, derived in the manner indicated below, 
will have a value between zero and unity.) The origin on 
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the av-scale (time) is taken at the year to which the first 
entry relates. 

The method employed in fitting this curve is an approxi- 
mative one, since the least squares procedure in customary 
form is not applicable. Here, as in the preceding example, 
the series is broken into three equal portions. The sum of 
the logarithms of the observations in each of these segments 
is obtained; from these sums, and the differences between 
them, the necessary constants may be computed. The 
method is illustrated with reference to the data of rayon pro- 
duction for the years 1920-1937, which appear in Table E. 


Table E 

CompiekUion of Qmntities Required in the Fitting of a 
Gompetiz Curve 

Production of Rayon Filament Yam in tiie United States, 1920-1937 
(Data in thousands of pounds) 


Year 

Rayon 

production 

y 

10,125 

Logy 

Svb-totds 

1920 

4 0063950 


1921 

14,986 

4 1756857 


1922 

24,067 

4.3814220 


1923 

34,959 

4.5435590 

Si = 

1924 

36,328 

4.5602415 

26.374290 

1925 

51,049 

4 7079872 


1926 

62,693 

4.7972191 


1927 

75,655 

97,232 

4 8782632 


1928 

4.9878092 


1929 

121,399 

5.0842151 

-82 = 

1930 

127,333 

5 1049409 

30.031076 

1931 

160,879 

5 1786288 


1932 

134,670 

5 1292709 


1933 

213,498 

5.3293938 


1934 

208,321 

6.3187331 


1935 

267,557 

5.4108734 

S,= 

1936 

277,626 

5.4434601 

32.126214 

1937 

312,236 

5.4944829 




88.5315830 



First 


differences 


di ^ Si — Si 
= 3.666786 


dt ^ St — St 
=2.095138 
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We may use n to define the number of terms entering 
into each of the three sub-totals (in the present example 
n = 6); the sub-totals are represented, in chronological or- 
der, by Si, Si, and Sz) the first differences between the sub- 
totals are represented by di and dz. ^ We use these quantities 
in solving for the three constants c, log b and log a. The 
general relations from which these values are determined 
are the following: 



Inserting the proper quantities, we have 


c” 


2.095138 

3.656786 


= .572945 


c = ^.572945 = .911351 


log 6 = 


3.656786 X - .088649 
(.572945 - ly 


1.777493 


log a = i 1 26.374290 
= 5.822848. 


3.656786 ] 
572945 - 1 j 


The required equation is, therefore, 

logy = 5.822848 - 1.777493(. 911351*) 

in which x relates to deviations from an origin at the position 
of the first term. 

Substituting in this trend equation the values of x given 
in Table F, logarithms of the trend values are obtained. The 
corresponding natural niunbers define the course of the line 
of trend. The method of calculation is indicated in Table F. 


1 The condition, previously noted, that the series to which the curve is to 
be fitted be one t^t is expanding by decreasing increments in the later 
stages of the period covered, is met when d% is less than di. 
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Table F 

Uluslrating the Cimpviaiion of Ordinates of Trend of a Gompertz 
Curve Fitted to Data of Rayon Prodv^ion, 1920-1937 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

y 

Anti-log of (5) 
{in thousands 

Year 

X 

c* 

{logb)cl‘ 

logy 

(4,)+ log a 





of pounds) 

1920 

0 

1 000000 

- 1.777493 

4 045355 

11,101 

1921 

1 

0.911351 

- 1 619920 

4.202928 

15,956 

1922 

2 

0.830560 

- 1 476315 

4.346533 

22,209 

1923 

3 

0 756932 

- 1 345441 

4.477407 

30,020 

1924 

4 

0.689830 

- 1.226168 

4.596680 

39,508 

1925 

5 

0 628677 

- 1 117469 

4.705379 

50,743 

1926 

6 

0 572945 

- 1.018408 

4.804440 

63,744 

1927 

7 

0.522154 

- 0 928125 

4.894723 

78,474 

1928 

8 

0 475865 

-0 845847 

4 977001 

94,842 

1929 

9 

0 433681 

- 0 770865 

5.051983 

112,715 

1930 

10 

0 395235 

- 0 702527 

5.120321 

131,923 

1931 

11 

0.360198 

-0 640249 

5.182599 

152,265 

1932 

12 

0.328267 

-0 583492 

5.239356 

173,623 

1933 

13 

0.299166 

- 0 531765 

5.291083 

196,471 

1934 

14 

0.272645 

- 0 484625 

5.338223 

217,883 

1935 

15 

0 248475 

- 0.441663 

5.381185 

240,539 

1936 

16 

0.226448 

- 0.402510 

5 420338 

263,231 

1937 

17 

0.206374 

-0 366830 

5.456018 

285,771 

1947 

27 

0.081566 

- 0.144983 

6.677865 

476,283 

1957 

37 

0.032238 

- 0.057303 

5.765545 

582,834 

1967 

47 

0.012741 

-0.022647 

5.800201 

631,245 


The original data and the Gompertz cm^e fitted to them 
are shown graphically in Fig. B. 

The ceiling to this curve is set by the constant a, which 
has a value of approximately 665,000,000 pounds. This 
indicates that if the extrapolation of the trend of rayon 
production from 1920 to 1937, as measured by a Gompertz 
curve, accurately defines the future course of production, 
the maximTim output to be expected is 665 million pounds 
per year. (It need hardly be pointed out that this extra- 
polation involves some doubtful assumptions, and that no 
m 3 'stic significance is to be attached to it.) The years to 
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which the present data relate were years of rapid expansion 
in the industry. The slackening of the rate of increase, 
which is to be expected in a mature industry, had not be- 
come marked by 1937. In order that the natvire of the curve 
may be clear, extrapolated values for 1947, 1957, and 1967 
are given in the table, and the projection of the trend is 



1920 1925 1930 1935 1940 1945 1950 1955 1960 1965 


Fia. B. — Total Production of Rayon Filament Yam in the United States, 
1920-1937, with Gompertz Trend Line Extrapolated to 1967 

shown in Fig. B. After 1947, and still more conspicuously 
after 1957, the curve shows a notable dampening in the rate 
of expansion. We may not say that the industry will actually 
follow this course. In particular, the asymptote a may be 
expected to change, as conditions affecting the industry 
and the demand for its products vary in the future. Within 
the limits of the observations, however, the Gompertz curve 
serves as a satisfactory measure of trend. 

The Logistic Curve 

The logistic curve, sometunes termed the Fearl-Reed 
growth curve because of the extensive use made of it in 
population studies by Raymond Pearl and L. J. Reed, 
resembles somewhat the Gompertz curve discussed above. 
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Table G 


Computation of Quantities Reguired in the Fitting of a Logistic Curve 
to Data of RaUroad Mileage Operated in the United States, 
by Five-Year Intervals, 1850-1935 


(1) 

(2) 

Miles of 

(3) 

(4) 

(5) 

Year 

railroad 

operated 

100,000,000 

y 

Subtotals 

First 

differences 


y 




1850 

9,021 

11,085 



1855 

18,374 

30,626 

5,442 

3,265 



1860 



1865 

35,085 

2,850 

Si = 25,882 


1870 

52,922 

1,890 



1875 

74,096 

1,350 



1880 

93,262 

1,072 


rfl =® lS»2 >Sii 

1885 

128,320 

779 


= - 21,849 

1890 

156,404 

177,746 

639 

Si = 4,033 


1895 

563 



1900 

192,556 

519 



1905 

216,974 

461 



1910 

240,831 

415 


dz ^ Sz Sz 

1915 

257,569 

388 


- - 1,679 

1920 

259,941 

258,631 

385 

Si = 2,354 


1925 

387 



1930 

260,440 

384 



1935 

252,930 

395 




It represents a modified geometric progression, the growth 
of a series that tends to decrease as it approaches some 
specified limit. like the Gompertz curve it may be used as 
an empirical approximation to the trends of certain economic 
series. Extrapolations are subject, of course, to the same 
uncertainties that attach to projections of other empirically 
derived trend lines. 

A form of this curve adapted to use as a measure of trend 
is defined by the equation 

1 
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This, it will be noted, is the equation to a modified exponen- 
tial cun^e, except that the dependent variable is rather 

than y. (The symbols here used for the constants differ 
somewhat from those employed in treating the modified 
exponential curve.) A method of fitting somewhat similar 
to those employed in the preceding examples may be em- 
ployed, with necessary modifications required by the use of 
reciprocals of y. The method may be discussed with refer- 
ence to the data of railroad mileage in Table G. Compu- 
tations are facilitated by multiplying the reciprocals of y by 
a suitable power of 10, as is done in col. (3) of this table. 

As in the two preceding illustrations, the observations are 
divided, chronologically, into three equal groups. Group 
sub-totals and the first differences between these sub-totals 
are computed. The symbol n is used for the number of terms 
in each of these sub-groups. The origin of the aj-scale (time) 
is set at the date of the first observation. The time unit 
here employed is five years. 

The constants in the desired equation may be derived 
from the following relations. 



^ _ di(c - 1) 

® - (c» _ 1)2 


o = 




}• 


Substituting the given values, we have 




c = 676846 = .652034 

, „ - 2l,849(- .347966) 
° “ (.07^46 - 1)* 


+ 8,921.14 
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a = 1 { 26^ - } = + 369.0i. 

These results relate to initial observations which have been 

modified by the multiplication of - by 100,000,000. The 

desired equation is, therefore, 

lOO^ O jl OO ^ 369.04 _|. 8,921. 14(. 652034*) 

V 

where x measures deviations in five-year xinits from an origin 
at 1850. 

Succeeding calculations are shown in Table H. 


Table H 

Commiiation of Ordinates of Trend of Logistic Curve Fitted to Data 


(1) 

(2) 

(3) 

Year 

X 

c* 

1850 

0 

1.000000 

1855 

1 

.652034 

1860 

2 

425148 

1865 

3 

.277211 

1870 

4 

.180751 

1875 

6 

.117856 

1880 - 

6 

.076846 

1885 

7 

.050106 

1890 

8 

032671 

1895 

9 

021303 

1900 

10 

013890 

1905 

11 

009067 

1910 

12 

.005905 

1915 

13 

.003850 

1920 

14 

.002511 

1925 

15 

.001637 

1930 

16 

.001067 

1935 

17 

.000696 


of Railroad Mileage 


(4) 

(6) 

100,000,000 

6c® 

y 

(0 6c*) 

8,921 

9,290 

5,817 

6,186 

3,793 

4,162 

2,473 

2,842 

1,613 

1,982 

1,051 

1,420 

686 

1,065 

447 

816 

291 

660 

190 

559 

124 

493 

81 

450 

53 

422 

34 

403 

22 

391 

15 

384 

10 

379 

6 

375 


( 6 ) 

y 

(100,000,000 X 

10,764 

16,166 

24,027 

36,186 

60,454 

70,423 

94,787 

122,649 

151,515 

178,891 

202,840 

222,222 

236,967 

248,139 

265,754 

260,417 

263,852 

266,667 
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The process of calculation is a strai^tforward one. The 
reciprocals of the entries in col. (5), multiplied by 100,000,- 
000, 3 deld the desired trend values given in col. (6). These 
values, with the original series, are shown graphically in 
Pig. C. 

As in the case of the Gompertz curve, the logistic is suit- 
able for measuring the trend of a series that, in its later 



Fig. C. — Railroad Mileage Operated in the United States, by Five-Year 
Intervals, 1856-1935, with Logistic Trend 

stages, is growing by decreasmg increments. The curve 
resembles an elongated S rising from a lower asymptote of 
zero to an upper limit indicated by the constant a. Since a in 
this case refers to an equation in which the dependent vari- 
able is 10Q>000>000 , the actual asymptote is 

y flt 

From the given value of a, 369.04, we derive 270,973 miles 
as the upper limit of railroad mileage in the United States. 
As is clear from the table and chart, the actual values are 
close to this indicated limit. Barring the possibility of a 
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fvmdamental change in relevant conditions, the record and 
the curve fitted to it indicate that the era of railroad ex- 
pansion has ended. The extrapolation is, of co\u*se, subject 
to all the reservations that attach to the projection of other 
cTirves. There can be no doubt that, within the htnits of 
the observations, the logistic curve gives an excellent rep- 
resentation of the actual history of railroad operation in the 
United States. 
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A FURTHER APPLICATION OP VARIANCE 
ANALYSIS 

The possibilities of Fisher’s method of variance analysis 
were far from exhausted by the several examples given m 
Chapter XV. We here supplement the treatment in that 
chapter by an additional example, illustrating further tests 
that may be made with a two-fold principle of classification. 

The observations on which this example is based consist 
of relative numbers, measuring the prices of 670 commodities 
m February, 1933, with average prices in 1926 taken as 100. 
February, 1933 marked the low point of the severe price 
decline that began in 1929. The questions to which our 
tests are directed relate to the relative severity of the declines 
occurring among different classes of goods. 

The 670 price relatives (obtained from price quotations 
compiled by the U. S. Bureau of Labor Statistics) may be 
classified into those relating to perishable goods (505 in 
number) and those relating to durable goods (165 in number) . 
The classification has economic significance because of differ- 
ences in the market conditions, on both supply and demand 
sides, affecting these classes of goods during a majo;r reces- 
sion. Again, the 670 observations may be broken down into 
those relating to raw materials (134 in number) and those 
relating to manufactured goods (536 in number). Applsdng 
the two principles of classification jointly we obtain 4 sub- 
groups, perishable raw materials (101 in nmnber), perish- 
able noanufactured goods (404 in number), durable raw 
materials (33 in number) and durable manufactured goods 
(132 in number). It is to be noted that the ratio of the num- 
W of perishable raw materials to the number of perishable 

6S1 
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manufactured goods, 101:404, is the same as the ratio 
of the number of dmable raw materials to the number of 
durable manufactured goods, 33 : 132. It is a necessary con- 
dition of the procedure here discussed that the frequencies 
in the several sub-groups be proportional. 

Various questions relating to the significance of these prin- 
ciples of classification may be answered with reference to the 
summary figures given in Table I. 

Table I 

MeasuremerUa Relating to the Analysis of the Relative Prices of 
of 670 Commodities for February, 1933 
(1926 > 100 ) 


1 

2 

I 

Perishable 

Perishable 

All 

raw materials 

manufactured goods 

perishable goods 

Ni = 101 

Ns=m 

Np = 505 

Ml = 41.663366 

Mt = 62 329208 

Mp - 58.196040 

2d* = 31,118.56 

2d* = 187,414.21 

2d* = 253,040.67 

3 

4 

II 

Dtiraible 

Durable 

AU 

raw materials 

manufactured goods 

durable goods 

iV, = 33 

Ns = 132 

Ns = 165 

M, = 66.060606 

Ms = 75.719697 

Ms = 73.687879 

2d* = 12,217.88 

2d* = 31,308.63 

2d* = 46,625.97 

A 

B 


AU 

AU 

AU 

raw materials 

numvfactured goods 

commodities 

Vr= 134 

iV„=636 

iV = 670 

Mr = 47.426373 

M„= 65.626866 

M = 61.986567 

2d* = 66,952.76 

2d* = 236,662.35 

2d* = 329,029.89 


The entries relating to each group and sub-group define 
the number of commodities included, the mean value of the 
price relatives for February, 1933, and the sum of the 
squares of the deviations of the observations in that group 
from the mean of that group. Thus for perishable raw ma- 
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terials the mean is 41.663366 (indicating an average price 
decline of 58.34 per cent) and the sum of the squares of the 
deviations of the 101 observations in this group from 
41.663366 is 31,118.56. For all commodities the mean is 
61.986567, and the sum of the squares of the deviations of 
the individual items from this mean is 329,029.89. 

A TEST OF THE PEmSHABUE-DUHABIiE PBINCIPliE OF 
CLASSIFICATIOISr 

We may first ask whether the application of the two 
basic principles of classification, considered separately, gives 
groups showing significant differences in their price changes 
from 1926 to February, 1933. Examining the results of the 
perishable-durable distinction, we note that durable goods, 
with an average of 73.587879, show smaller price declines 
than perishable goods, for which the average is 58 . 196040. 
(Six decimal places are retained in the averages because these 
figures enter into later calculations.) Is the difference sig- 
nificant, or may it be attribute^J to chance? A test of the 
type discussed in Chapter XV provides an answer to this 
question. For the application of the test we must divide the 
total variability, 329,029.89, into a portion xmaffected by 
perishable-durable differences and a portion that may be 
attributed to the play of forces directly related to this dis- 
tinction. 

The first of these portions, measuring the variability 
within classes, is derived directly from the figures in Table I. 

Variability within perishable group 

= ScP for that group =253,040.57 
Variability within durable group 

= StP for that group = 46,525.97 
Total variability within classes 299,566 . 54 

In deriving a measure of the variability between classes 
we take the deviation of each class mean from the mean of 
all the observations, square this, and w»^t by the number 
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of observations in that class. Thus 

between perishable-durable classes 
= [(61 . 986567 - 58 . 196040)2 ^ 505] -|- [(61 . 986567 - 73 . 587879)2 

X 165] = 29,463.31. 

A test of the significance of this classification reduces to 
the question whether the variability between classes is sig- 
nificantly greater than the variability within classes, when 
account has been taken of the number of degrees of freedom 
present in the two measures of variability. The appropriate 
2 -test is shown below. 


Nature of 
variability 

Degrees of 
freedom 
n 

Sum of 
sqxiares 

Variance 

0-2 

LoQeCr^ 

Between classes 

1 

29,463 31 
299,566 54 

29,463 31 

10 290900 

Within classes 

668 

448 45 

6.1058)4 


669 

329,029 85 

Diff. 

= 4 185096 



2 

= 2.09 


For Wi = 1 and n 2 = 668 the 1 per cent value of z is approx- 
imately 95 ; the present yalue is materially greater than 
this. The variance between classes is significantly greater 
than the variance within classes. The results are not con- 
sistent with the hypothesis that the true value of z is zero. 
There is a significant difference between the February, 1933 , 
price relatives of perishable and durable goods, on the 1926 
base. This principle of classification is a significant one, 
with reference to this aspect of price behavior. 

A TEST OF THE RAW-MANUPACTUEED PRINCIPLE OP 
CLASSIPICATION 

The test of the other main principle of classification follows 
exactly the same lines. The total variability, 329 , 029 . 89 , 
is broken into a portion measuring variability within classes 
( 293 , 515 . 11 ), with 668 degrees of freedom, and a portion 
measuring variability between the raw-manufactured classes 
( 35 , 514 . 75 ) with 1 degree of freedom. The value of z is 
2 . 20 ; the corresponding 1 per cent value of « is . 95 . This 
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principle of classification, also, is significant. Baw and 
manufactured goods differed significantly in degree of price 
change between 1926 and February, 1933. 

A TEST OF THE RESULTS OBTAINED FROM THE JOINT APPLICA- 
TION OF THE PERISHABLE-DURABLE AND RAW-MANU- 
FACTURED PRINCIPLES OP CLASSIFICATION 

The application of the two principles of classification dis- 
cussed above yields the 4 cells shown in Table I. We may 
ask whether the four groups thus distinguished — perishable 
raw materials, perishable manufactured goods, durable raw 
materials, and durable manufactured goods — are signifi- 
cantly different, judged with reference to the present obser- 
vations. The two essential elements of the total variability 
axe derived from the figures in Table I in the manner indi- 
cated below. 

Variability within perishable raw materials group = 31,118.56 
Variability within perishable manufactures group = 187,414.21 
Variability within durable raw materials group = 12,217.88 
Variability within durable manufactures group = 31,308.63 

Total variability within colls 262,059 . 28 

This sum furnishes the yardstick that is used in the tests 
that follow. It is clear that it represents the action of forces 
other than those related to relative durability, or to degree 
of fabrication. For its four elements measure variability 
among commodities that are alike in respect of dm*ability 
and alike in respect of degree of fabrication.^ This sum is a 
measure of the strength of the forces we lump together 
as chance, which here means all factors affecting our observa- 
tions other than those related to the relative durability of 
commodities or to degree of fabrication of commodities. 

^ This statement may be accepted as accurate for the purpose of the present 
demonstration. Actually, of course, the distinctions between perishable and 
durable commodities and between raw and manufactured goods are not clear- 
out and definitive. 
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A measure of variability between cells is derived as in the 
previous examples. 

S<P between cells = [(61.986567 - 41.663366)* X 101] 

+ [(61.986667 - 62.329208)* X 404] 
+ [(61.986567 - 66.060606)* X 33] 
+ [(61.986667 - 75.719697)* X 132] 
= 66,970.60. 

The test of edgnificance takes the following form. 


Nature of 
variaMlity 

Degrees of 
freedom 
n 

Sum of 
squares 

Variance 

LoQtC* 

Between cells 

3 

66,970 60 

22,323.53 

10.013395 

Within cells 

666 

262,059.28 

393.48 

6.976036 


669 

329,029.88 

Diff. 

« 4.038360 




z 

- 2 02 


For n\ = 3, »2 = 666 the 1 per cent value of « is approxi- 
mately .67. The present value materially exceeds this. 
The conclusion is clear that the joint application of the two 
principles of classification yields sub-groups which differed 
significantly in their price movements between 1926 and 
February, 1933. 

FTOTHEB TESTS OF THE MAIN PRINCIPLES OP CLASSIFICATION 

The test applied in the preceding section does not bring 
out the most significant uses of a two-fold principle of classi- 
fication. In treating the four cells as we have, we have not 
made fuU use of the information we possess about them. The 
variance between cells, measured by the sum 66,970 . 60, with 
3 degrees of freedom, represents the combined influence of 
forces related to the perishable-durable principle of classifi- 
cation, to the raw-manufactured principle, and to the inter- 
action among forces related to these two principles. We 
may apply more refined tests, and obtain more accurate 
information about the differential price behavior of com- 
modities of different types, by distinguishing the components 
of the variance between cells. This is done in Table J, 
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which presents a complete breakdown of the total variance 
of the observations with which we are working. 

Tablb J 

Components of Variance among Observations Relating to Commodity 
Price Movements, 1926 — February, 1933 


Nature of 

(1926 = 100) 
Degrees of 

Sum of 

Variance 

variability 

freedom 

squares 

(T* 

Between perishable-durable 

classes 

1 

29,463 31 

29,463.31 

Between raw-manufactured 
classes 

1 

35,514.75 

35,514 75 

Interaction (residual varia- 

bility between cells) 

1 

1,992.54 

1,992.55 

Within cells (“experimental 

error”) 

666 

262,059 28 

393.48 


669 

329,029 88 



Having these components we may test with greater ac- 
curacy than on pages 683 and 684 the significance of the two 
main principles of classification. For we now have a better 
yardstick, a better measure of the magnitude of variations 
due to the play of “chance.” The variability within cells 
(variance = 393.48) is a better criterion of the magrd- 
tude of sampling errors than is the variability within the 
perishable and durable classes (variance — 448.45) or the 
variability within the raw and manufactured classes (vari- 
ance = 439.39). For the variance within the four cells is 
free of the influence of forces connected with either of the 
specified principles of classification. 

This more accurate test of the perishable-durable prin- 
ciple of classification is applied by the customary method. 

Natmecf Degrees of Variance - - 

variatiliiy freedom <r* 

Between perishable-durable 

classes 1 29,463.31 10.290900 

tTithin cells 666 393.48 5.975035 

Diff. = 4.315865 
2 = 2.16 
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The 1 per cent value of z is approximately .95, and the 
above result is clearly significant. The application of the 
perishable-durable principle of classification, under the con- 
ditions represented in Table I, yields classes of commodities 
that differed significantly in their price changes between 
1926 and February, 1933. It is important to note that raw 
and manufactured goods are present in the perishable and 
durable groups in precisely the same proportions. One fifth 
of the commodities in each group are raw materials and 
four fifths are manufactured goods. Thus behavior peculiar 
to raw materials may be expected to influence the two 
groups in precisely the same degree; the same is true of 
behavior peculiar to goods in the manufactured state.* It is 
necessary, for this reason, that the frequencies in the several 
classes be proportional in the application of the tests here 
discussed, when two principles of classification are jointly 
employed.* 

A test of the significance of the raw-manufactured prin- 
ciple of classification may be applied in the same way. 
The variance within cells is employed as yardstick, as in the 
preceding example. Here, also, proportionality is necessary, 
with raw and manufactured goods being divided in the same 
proportions into perishable and durable sub-groups. The 
test reveals a significant difference in price behavior between 
raw and manufactured connuodities. 

A TEST OE THE INTERACTION 

Not all the variability between cells is explained by th e 
two major classific atirtna Wft liavft jn irt diamiR.«^ftd. The 
residual variability between cells, or the interaction, amounts 
to 1,992.64, in terms of squared deviations (see Table J). 

^ See below, however, for a test of the significance of the interaction. 

* For a discussion of procedures appropriate to cases in which cdl frequencies 
are not proportional see 

Yates, F. Jownrujl of AgriciiUuraJt Science^ VoL 23, 108 (1933). 

Snedecor, G. W. and Cox, G. M. Iowa Agric^iUural ExperimerU Station 
Bulletin 180 (1935). 
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This may be derived readily by subtracting from the total 
variability between cells (66,970.60) the sum of the variabil- 
ity between perishable-durable classes (29,463.31) and the 
variability between raw-manufactured classes (35,514.75). 
The number of degrees of freedom in the interaction may be 
determined by the same process of subtraction. In the pres- 
ent instance it is 1. 

This residual variahility may represent “experimental 
error, ” the play of the same chance forces that are measured 
by the variability within c5Is l The residual variability 
was used, in the last example cited in Chapter XV, as a yard- 
stick defining the magnitude of fluctuations due to chance. 
It is proper to assume that this is the case when the two 
major principles of classification are quite independent of 
one another. But if these principles are correlated, the re- 
sidual variability reflects the interaction of the two prin- 
ciples of classification — the differential behavior of given 
classes of goods under the influence of forces related to the 
other principle of classification. Thus it may ha t.hnt. +.hA 
hfttwftftp raw nfirishaMe and manufactured perish- 

able goods ia not the same as the diffp.rpn^o W.woon raw 
durable and manufa nt,nrfid dumhiA gnnHs The process of 
fabrication applied to perishable goods may produce results 
(in the form of price behavior) different from those produced 
when the process of fabrication is applied to durable goods. 
Perishable and durable goods may respond differently, as 
regards their price behavior, to the influence of fabrication. 
Snob differential beha vior of categories of goods under the 
influence of th e, .. same tre a tm ent fabricatiQn)..ia-mea&- 
ured bv the inierrution. 

If there is no such differential behavior, in a given ex- 
periment, the residual variability between cells will be of 
the same order of magnitude as the variability within cells, 
when account is taken of number of degrees of freedom. A 
test is applied on page 690. 

If we judge this result with reference to the 1 per cent 
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Nature of Degrees of Sum of Variaruse - 

variability freedom squares a* * 

Interaction (residual vari- 
ability between cells) 1 1,922.64 1,922.54 7 561429 

Within cells 666 262,059.28 393 48 5.975035 

Diff. = 1.586394 
s= .79 

value of 2 (.96), we would conclude that the residual vari- 
ability between cells is attributable to the play of chance 
rather than to any true interaction. For although the resid- 
ual variability is greater than the variance within cells 
which we use as yardstick, the excess is not clearly too great 
to be attributed to chance. Reference to the 5 per cent 
value of 2 (.675, for ni = 1, wa = 666) throws more light 
on the situation. Less frequently than 5 times out of 100 
would the play of chance alone give us a measure of resid- 
ual variability as great as that here obtained. For the z 
of . 79 is greater than the 5 per cent value, . 675. In such a 
case as this, where P falls between .01 and .05, the evidence 
is not conclusive. There is, however, a strong indication 
that perishable and durable goods respond differently, in 
their price behavior, to the process of fabrication. Reference 
to Table I will show that among both perishable and dm- 
able goods fabrication appears to have reduced susceptibility 
to price decline under the force of business recession. Mz 
is distinctly greater than Mi, and Mt is greater than Ms. 
But the influence of fabrication was apparently greater 
among perishable than among durable goods. ^ Our test 
shows that the degree of difference between the two reduc- 
tions (i.e., reductions in degree of price decline) is almost 
too great to be attributed to chance. The evidence of differ- 
ential behavior is strong enough to justify further investi- 
gation. 

* The statistical evidence does not, of course, yield information as to the 
nature of the causal rdations involved. The test here applied, if positive, 
reveals the presence of interaction, but does not i^ow how the forces involved 
interact to bring about the observed differential behavior. The text is to be 
read with this qualification in mind. 
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GLOSSARY OF SYMBOLS 

The following are the more important symbols employed 
in the preceding pages. Those of which limited use is made, 
for special purposes, are not here included. A given symbol 
is sometimes called upon to serve different purposes, but the 
precise meaning should be clear from the context. 

1, General symbols for variables and constants: 

x: a variable quantity. 

y: a variable quantity. 

In general, any letter near the end of the alphabet may 
be employed to represent a variable quantity. Different 
variable quantities may be represented by the use of a 
single symbol, with different subscripts, as Yi, JSl 2 , Xa, 
or Wi,W%,Wz. [A distinction is later drawn (cf. Sym- 
bols employed in the measurement of relationship) 
between capital letters and small letters, as used to 
represent variable quantities.] 

a: a constant (i.e., a quantity the value of which does not 
change in the given discussion). In general, any letter 
near the beginning of the alphabet may be used to 
represent a constant. 

2. Symbols employed in the analysis and description of 

the frequency distribution: 

m: the value of an individual observation; the value of the 
mid-point of a class. (The symbols ai, a 2 , as are some- 
times employed to represent different observations in 
a series.) 

f: the niunber of observations in a given class; the frequency 
of a given class. 

i: the class-interval. 

Z: the lower limit of a class. 

N: the total number of cases in a given series or frequenpy 
distribution. 
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d: the deviation of a given observation from an average; 
xisually, a deviation from the arithmetic mean. When 
written with a subscript, as d* or dy, it refers to a devia- 
tion from the arithmetic mean of the variable repre- 
sented by the subscript. The symbol d is sometimes 
used to designate the difference between mean and 
mode. 

d': the deviation of a given observation from an arbitrary 
origin, or assumed mean. 

c: the difference between an arbitraiy origin, or assumed 
mean, and the true mean (in terms of the symbols ex- 
plained below, c = M — M'). 

S (Sigma): the symbol for the process of summation. Thus 
Sd means the sum of all the deviations. 

toi, Wi, Wz: weights attached to a series of measures being 
averaged. (Not to be confused with similar symbols 
used to represent different variable qtiantities.) 

yo* the maximum ordinate of a frequency curve. 

Symbols for averages, quarliles, etc.: 

M: the arithmetic mean. 

Md.: the median. 

Mo.: the mode. 

Mg", the geometric mean. 

H: the harmonic mean. 

M': the value of an assiuned arithmetic mean. 

Qi: the first or lower quartile. 

Qz: the second quartile or median. 

Qz: the third or upper quartile. 

K: the value of a point midway between the first and third 
quartiles. 

Dz: the third decile. 

Symbols for measures of mriaiion avd skewness: 

M.D.: the mean deviation. 

a: the standard deviation; the root-mean-sciuare deviation 
about the arithmetic mean. 

</: the standard deviation of proportions, or relative fre- 
quencies. 

««: the root-mean-square deviation about an origin other 
than the arithmetic mean. 

P.E.: the probable error. 
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Q.D,: the quartile deviation. 

qi: the difference between the median and the lower quartile 
(Md. ~ Qi). 

q 2 : the difference between the upper quartile and the median 
(Qz - Md.). 


V: the coeflBicient of variation. 
sk: a measure of skewness. 

X (Chi): a measure of skewness based upon the criteria /Si 
and 182. 


Symbols for moments and criteria of curve type. 


V 2 i vz) etc. : moments of a frequency distribution about an 
arbitrary origin. 

^1, 7r2, Tzj etc.: uncorrected moments of a frequency distribu- 
tion about the arithmetic mean. 

Ml) M2) Ms, etc. ; moments of a frequency distribution about the 
arithmetic mean after the application of Sheppard's 
corrections. 



K 2 : A criterion of curve type based on /3i and jSa. 


3. Symbols relating to index numbers, 

pf: price of a given commodity at time ‘*0 " (the base period), 
go'.* quantity of same commodity at time ^*0". 
pi': price of same commodity at time ‘‘1”. 
gi': quantity of same commodity at time 
po".* price of a second commodity at time ^^0". 
go".* quantity of second commodity at time “0". 

Pi"; price of second commodity at time “1". 
gi"; quantity of second commodity at time ^‘1". 

7) ' 

a price relative (relation of price of a given commodity 
Po 

at time “1 " to price of same commodity at time ‘‘0"). 

/ 

a quantity relative. 
qo 

Po; price level at time “0”. 

Pi: price level at time “1”. 
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4 . Symbols employed in the measurement of relationship. 
X: an observed value of a variable quantity. 

Y: an observed value of a variable quantity. (The observed 
values of different variables may be represented also by 
_ the symbols Xi, X2, X3, or TTi, TFa, TFsO 
X: the arithmetic mean of a number of observed values of 
the variable X. A similar symbol may be employed for 
other variables. (In one demonstration in the preceding 
pages, relating to multiple correlation, the S 3 anbols Ai, 
are used to represent the arithmetic means 
of the variables Xi, X2, X3 • . • • The s3rabols Mx 
and My are occasionally employed to designate the 
arithmetic means of different variables.) 
x: value of a variable quantity expressed as a deviation 
from the arithmetic mean of all the observed values. 
The symbol y and the S3niibols a;i, xz, Xz . . . are 
similarly employed with respect to variables repre- 
sented, as to original observations, by the symbols 
T, Xi, X2, Xa . . . . 

X': a value of a variable quantity expressed as a deviation, 
in class-interval units, from an arbitrary origin. The 
symbol F' has a similar meaning. 

X"; a value of a variable quantity expressed as a deviation, 
in original units, from an arbitrary origin. The sym- 
bol F" has a similar meaning. 

Fc* the computed or estimated value of a variable, as de- 
termined from an equation of average relationship; 
the symbol yc may be employed for such a computed 
value, expressed as a deviation from the mean, 
p: the mean product of two variables when expressed as 
deviations from their respective arithmetic means, i.e., 

p = written with subscripts, as pu, the 

latter relate to the variables in question, as Xi, Zs. 
p'; the mean product of two variables when expressed as 
deviations from assumed arithmetic means. 
r: the Fearsonian coefficient of correlation. When written 
with subscripts, the latter indicate the variables to 
which the coefficient relates. Thus r,» refers to the 
variables y and x, and rw refers to the variables Xi 
and 22. 

p (rho) : a general index of correlation. Subscripts should be 
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employed to indicate the variables to which the meas- 
ure relates, as pj/*, pay, Piog y*, Piog y log »» Pi > etc. 

(In each case the first subscript relates to the depend- 
ent variable.) 

p: a corrected index of correlation. 

d: the deviation of a given observation from a fitted curve; 
the difference between an observed and a corresponding 
computed value of a variable. 

v: a residual; identical in meaning with d, as given above. 

S: the root-mean-square deviation about a fitted curve; 
the standard error of estimate. This measure should 
be written with a subscript to indicate the variable to 
which it applies, as S., /Slog V (til® standard error of 
estimate in terms of logarithms), Sr (the standard error 
of estimate in terms of ratios), /Si (the standard error 

V 

_ of estimate in terms of reciprocals). 

S: a corrected standard error of estimate. 

Pr.* the coefficient of rank correlation. 

V (eta): the correlation ratio. Subscripts should be employed 
to represent the variables to which the measure re- 
lates, as or The first subscript in each case 
relates to the dependent variable. 

ij: a corrected correlation ratio. 

(Tay: the root-mean-square deviation about a line through the 
means of the columns of a correlation table; the stand- 
ard deviation of the y-arrays about their respective 
means. The symbol <raa has the same meaning with 
respect to the rows of a correlation table, or the 
a>arrays. 

o-ffly.' the standard deviation of the means of the columns of 
a correlation table about the mean of all the y’a, the 
mean of each column being weighted by the number of 
items in that column. The syiubol <rm» has the same 
meaning with respect to the means of the rows, 
f (zeta): the test for linearity of regression (f = ij® — r®). 

m: the number of arrays employed in the computation of a 
given correlation ratio; a]k>, the number of constants 
in the equation defining a curvilinear or multiple 
regression. 

b: the coefficient of regression; the slope of a line of regres- 
sion. When written with subscripts, the latter rdate 
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to the variables in question, as hyx, hn (for the variables 
xi, X2). The first subscript relates to the dependent 
variable in each case; byx is the coefficient of regression 
of 2/ on a; and bxy is the coefficient of regression of x on y, 
z: a logarithmic transformation of the coefficient of corre- 
lation. z = i{logfl (1 + r) — logf (1 — r)}. 

-Bi. 284- the coefficient of multiple correlation between a de- 
pendent variable, xi, and a combination of independent 
variables, X2, Xs, and x^- The order may be changed, 
but the primary subscript always relates to the de- 
_ pendent variable. 

B1.2S4: a corrected coefficient of multiple correlation. 
ri2.84* the coefficient of partial or net correlation between the 
variables Xi and 0:2, when the variables Xz and X4 are 
held constant. The order of subscripts is changed for a 
dffiEerent combination of variables, the two primary 
subscripts always relating to the variables between 
which the net correlation is being measured. 
bi2.84: the coefficient of net regression between the variables 
xi and X2, the former being dependent, when the vari- 
ables Xz and Xi are also taken account of in the estimat- 
ing equation; the weight given to X2 in estimating Xi^ 
when the estimate is also based upon values of xs and 
X4. The order of subscripts is changed for a different 
combination of variables. 

51.284.* the root-mean-square deviation about a line describing 
the relationship between a dependent variable, Xi, and 
a series of independent variables, X2, xs, and X4; the 
standard error of estimate of Xi under these conditions. 
o’i.284* the standard deviation of the fourth order; identical 

with Si .284. 

i9i2.84: a coefficient of partial regression in an equation relating 
to variables expressed in standard deviation units. 

(In the seven measures immediately above, the niimber of subscripts corre- 
sponds to the number of variables included in a given study. For the sake of sim- 
plicity, only four variables have been assumed.) 

5. Symbols employed in the measurement of errors. 

<r: the standard deviation of a parent population. 
ffjf or (T*: the standard error of a mean, derived from a knowledge 
of the or of the population. 
s: the standard deviation of a sample. 
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8^ or sj; the estimated standard error of a mean, in the deriva- 
tion of which 8 is used as an approximation to <r. 

T: the deviation of a given statistical measurement from 
the mean of a normal distribution, expressed in units 
of the standard deviation of that distribution; a normal 
deviate. 

t: the deviation of a given statistical measurement from a 
hypothetical value, expressed in units of the estimated 
standard error of the measurement in question. 

CTM*: the standard error of the mean of a stratified sample. 

D: a difference between two means. 

ctd: the standard error of the difference between two means. 

Dp.* a difference between two percentages. 

The standard error of the difference between two per- 
centages. 

Da.* the difference between two logarithmic transformations 
of the coefficient of correlation, 
the standard error of D*. 

<r, with any subscript, is used to represent the standard 
error of the measure to which the subscript relates. 
P.E. with any subscript is used to represent the prob- 
able error of the measure to which the subscript re- 
lates (P.E. = .67449<r). 

a^-6a' the standard error of the difference between two coeffi- 
cients of regression. 

6. Symbols employed in the analysis of variance. 

z: the difference between the natural logarithms of two 
standard deviations, 
the standard error of z. 

Tii: the number of degrees of freedom in the larger of two 
variances being compared. 

712.* the number of degrees of freedom in the smaller of two 
variances being compared. 

7. Other symbols. 

p; the probability of a successful outcome of a given 
event. 

q: the probability of an unsuccessful outcome of a given 
event. 

n: the number of independent events in a given trial. 
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X*- a quantity used in testing hypotheses involving the 
computation of theoretical frequencies; y} defines the 
relative magnitude of the differences between observed 
and theoretical frequencies. 
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Appendix Table I 


Areas of the Normal Curve of Error in Terms of Abscissa 


X 

tr 

.00 

,01 

.02 

.03 

.04 

.05 

.06 

.07 

.08 

.09 

0.0 

.00000 

.00399 

.00798 

01197 

.01565 

01094 

.02302 

02790 

.03188 

.03586 

0.1 

.08983 

04380 

.04776 

.05172 

.05567 

05962 

06356 

06749 

07142 

07535 

0.2 

.07926 

.08317 

.08706 

09096 

.09483 

09871 

.10257 

10642 

.11026 

.11409 

0.3 

, 11791 

.12172 

.12552 

.12930 

.13307 

13683 

.14058 

14431 

.14803 

15173 

0.4 

.15542 

.15910 

.16276 

.16640 

.17003 

.17364 

.17724 

.18082 

.18439 

.18793 

0 6 

.19146 

.19497 

.19847 

20104 

.20540 

20884 

.21226 

.21566 

.21904 

22240 

0.6 

.22575 

.22907 

.23237 

.23565 

.23861 

.24215 

.24537 

24857 

.26175 

25490 

0 7 

25804 

.26115 

.26424 

.26730 

.27035 

.27337 

.27637 

.27935 

.28230 

.28524 

0.8 

.28814 

.29103 

.29389 

.20673 

.26655 

30234 

.30511 

.30785 

.31067 

.31327 

0.9 

.31594 

.31859 

.32121 

.32381 

.32639 

.32894 

.33147 

.33398 

.33646 

.33891 

1.0 

.34134 

.34375 

.34614 

.34850 

.35083 

35314 

.35543 

.35769 

.35993 

.36214 

1.1 

.36433 

.36650 

.36864 

.37076 

.37286 

37403 

.37668 

.37900 

.38100 

.38298 

1.2 

.38493 

.38686 

.38877 

.39065 

.39251 

.39435 

.39617 

.39796 

.39073 

.40147 

1.3 

.40320 

.40490 

.40658 

.40824 

.40988 

.41149 

.41309 

41466 

.41621 

.41774 

1.4 

.41924 

.42073 

.42220 

.42364 

.42507 

.42647 

.42786 

.42922 

.43056 

.43189 

1.5 

.43319 

.43448 

.43574 

.43699 

.43822 

.43943 

.44062 

.44179 

.44295 

.44408 

1.6 

.44520 

.44630 

.44738 

.44845 

.44950 

.45053 

.45154 

.45254 

.45352 

.45449 

1.7 

.45543 

.45637 

.45728 

.45818 

,45907 

.45994 

.46080 

.46164 

.46246 

.46327 

1.8 

.46407 

.46485 

.46562 

.46638 

.46712 

.46784 

.46856 

.46926 

.46995 

.47062 

1.9 

.47128 

.47193 

.47257 

.47320 

.47381 

.47441 

.47600 

.47558 

.47615 

.47670 

2.0 

.47725 

.47778 

.47831 

.47882 

.47932 

.47982 

.48030 

.48077 

.48124 

.48169 

2.1 

.48214 

.48257 

.48300 

.48341 

.48382 

.48422 

.48461 

.48500 

.48537 

.48574 

2.2 

.48610 

.48645 

.48679 

.48713 

.48745 

48778 

.48809 

.48840 

.48870 

.48899 

2.3 

.48928 

.48956 

.48983 

.49010 

.49036 

.49061 

.49086 

.49111 

.49134 

.49158 

2.4 

.49180 

.49202 

.49224 

.49245 

.49266 

.49286 

.49305 

.49324 

.49343 

.49361 

2 6 

.49379 

.49396 

.49413 

.49430 

.49446 

49461 

.49477 

.49492 

.49506 

.49620 

2.6 

.49534 

.49547 

.49560 

.40573 

.49585 

.49598 

.49609 

.49621 

.49632 

.49643 

2.7 

.49653 

.49664 

.46674 

.49683 

.49663 

.49702 

.49711 

.49720 

.49728 

.40736 

2.8 

.49744 > 

.49752 

.49760 

.49767 

.49774 

.49781 

.49788 

.49795 

.49801 

.49807 

2.9 

.49813 

.49819 

.40825 

.49831 

.49886 

.49841 

.49846 

.49851 

.49856 

49861 

3.0 

.49865 

.49869 

.49874 

.49878 

.49882 

.49886 

46889 

49863 

.49897 

.49000 

3 1 

.49903 

.49006 

.49910 

.49913 

.46916 

.49018 

.49921 

.46924 

.49926 

.49029 

3.2 

.49931 

.49934 

.49936 

.49938 

.49940 

.49942 

.49944 

.49946 

.49948 

.49050 

3.3 

.49952 

.49953 

.49955 

.40957 

.4995$ 

.49960 

.49661 

.49962 

.49064 

.49965 

3.4 

*49966 

.49968 

.49969 

.49970 

.49971 

.49972 

.49973 

.49974 

.49075 

.49976 



Appenpix Table II * 


Table of t 


n 

P = .05 

02 

.01 

1 

12 706 

31 821 

63 657 

2 

4 303 

6 965 

9.925 

3 

3 182 

4 541 

5 841 

4 

2 776 

3 747 

4.604 

5 

2 571 

3 365 

4 032 

6 

2 447 

3.143 

3 707 

7 

2 365 

2 998 

3 499 

8 

2 306 

2 896 

3 355 

9 

2 262 

2 821 

3.250 

10 

2.228 

2 764 

3.169 

11 

2 201 

2 718 

3.106 

12 

2 179 

2 681 

3.055 

13 

2 160 

2 650 

3.012 

14 

2 145 

2 624 

2 977 

15 

2 131 

2.602 

2 947 

16 

2 120 

2.583 

2.921 

17 

2 no 

2 567 

2 898 

18 

2 101 

2 552 

2 878 

19 

2 093 

2.539 

2.861 

20 

2 086 

2 528 

2.845 

21 

2 080 

2 518 

• 2.831 

22 

2 074 

2.508 

2.819 

23 

2.069 

2.500 

2.807 

24 

2.064 

2.492 

2.797 

25 

2.060 

2 485 

2.787 

26 

2 056 

2 479 

2.779 

27 

2 052 

2.473 

2.771 

28 

2.048 

2.467 

2.763 

29 

2.045 

2.462 

2.756 

30 

2.042 

2.457 

2.750 

00 

1.95996 

2 32634 

2.57582 


‘ Excerpts from Table IV, R. A. Fisher, Statistical Methods for Research 
Workers. These excerpts are printed here through the courtesy of Dr. Fisher 
and his publishers, Oliver and Boyd, of Edinburgh. 
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Appendix Table III ^ 


Valuea of the Correlation Coefficient for Different Levels of 
Significance 


n 

P = .05 

02 

.01 

1 

.996917 

.9995066 

9998766 

2 

95000 

98000 

.990000 

3 

8783 

93433 

.95873 

4 

8114 

.8822 

91720 

5 

7545 

8329 

8745 

6 

.7067 

7887 

.8343 

7 

6664 

.7498 

.7977 

8 

6319 

7155 

.7646 

9 

6021 

6851 

.7348 

10 

.5760 

.6581 

.7079 

11 

5529 

6339 

.6835 

12 

.5324 

6120 

.6614 

13 

.5139 

5923 

.6411 

14 

.4973 

.5742 

6226 

15 

.4821 

5577 

.6055 

16 

.4683 

5425 

.5897 

17 

.4555 

.5285 

5751 

18 

.4438 

5155 

.5614 

19 

.4329 

.5034 

.5487 

20 

.4227 

4921 

.5368 

25 

.3809 

.4451 

4869 

30 

.3494 

.4093 

.4487 

35 

.3246 

.3810 

.4182 

40 

.3044 

.3578 

.3932 

45 

,2875 

.3384 

.3721 

50 

.2732 

3218 

.3541 

60 

.2500 

.2948 

.3248 

70 

.2319 

.2737 

.3017 

80 

2172 

.2565 

.2830 

90 

.2050 

.2422 

.2673 

100 

.1946 

.2301 

.2540 


For a total correlation, n is 2 less than the number of pairs in the 
sample; for a partial coirelation, the number of eliminated variates also 
shoidd be subtracted. 


^ Excerpts from Table V-A, R, A. Fisher, Statistical Methods for Res^ch 
Workers* These excerpts are printed here through the courtesy of Dr. Fisher 
and his publishers, Oliver and Boyd, of Edinburgh. 
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Appendix Table IV 


Showing the Relations between r and z for V alms of z from 0 to 5 ^ 


2 

00 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

.08 

.09 

.0 

.0000 

.0100 

.0200 

.0300 

.0400 

.0500 

.0599 

.0699 

0798 

.0898 

.1 

.0997 

.1096 

.1194 

.1293 

1391 

.1489 

.1587 

.1684 

.1781 

.1878 

.2 

.1974 

.2070 

.2165 

.2260 

.2355 

.2449 

.2543 

.2636 

.2729 

.2821 

.3 

.2913 

.3004 

.3096 

.3185 

3275 

.3364 

3452 

.3540 

3627 

3714 

.4 

.3800 

.3885 

.3969 

.4053 

.4136 

.4219 

.4301 

.4382 

.4462 

.4542 

.5 

.4621 

.4700 

.4777 

.4854 

.4930 

.5005 

.5080 

.6154 

.5227 

.5299 

.6 

.6370 

.5441 

.6511 

.6581 

.5649 

.6717 

.5784 

.5850 

.6916 

.5980 

.7 

.6044 

.6107 

.6169 

.6231 

.6291 

.6352 

.6411 

.6469 

.6527 

.6584 

.8 

.6640 

.6696 

6761 

.6805 

.6858 

.6911 

.6963 

.7014 

.7064 

.7114 

.9 

.7163 

.7211 

7269 

.7306 

.7362 

.7398 

7443 

.7487 

.7531 

.7674 

1.0 

.7616 

.7668 

.7699 

.7739 

.7779 

.7818 

.7857 

.7895 

.7932 

.7969 

1.1 

.8005 

.8041 

.8076 

.8110 

.8144 

.8178 

.8210 

.8243 

.8276 

.8306 

1.2 

.8337 

.8367 

.8397 

.8426 

.8455 

.8483 

.8511 

.8538 

.8565 

.8591 

1.3 

.8617 

.8643 

.8668 

.8693 

.8717 

.8741 

.8764 

.8787 

.8810 

.8832 

1.4 

.8854 

.8875 

.8896 

.8917 

.8937 

.8967 

.8977 

.8996 

.9015 

.9033 

1.6 

9052 

.9069 

.9087 

.9104 

.9121 

.9138 

.9154 

.9170 

.9186 

.9202 

1.6 

.9217 

.9232 

.9246 

.9261 

.9276 

.9289 

9302 

.9316 

.9329 

.9342 

1.7 

.9354 

.9367 

.9379 

.9391 

.9402 

.9414 

.9425 

.9436 

.9447 

.9458 

1.8 

.9468 

.9478 

.9498 

.9488 

.9508 

.9518 

9627 

.9536 

.9546 

.9554 

1.9 

.9562 

.9671 

.9679 

.9587 

.9595 

.9603 

.9611 

.9619 

.9626 

.9633 

2 0 

.9640 

.9647 

.9654 

.9661 

.9668 

.9674 

.9680 

.9687 

.9693 

.9699 

2.1 

.9706 

.9710 

.9716 

.9722 

.9727 

.9732 

9738 

.9743 

.9748 

.9753 

2.2 

.9757 

.9762 

.9767 

.9771 

.9776 

.9780 

.9786 

.9789 

9793 

9797 

2.3 

.9801 

.9805 

.9809 

9812 

.9816 

.9820 

.9823 

.9827 

.9830 

.9834 

2.4 

.9837 

.9840 

.9843 

.9846 

.9849 

.9852 

.9855 

.9858 

.9861 

.9863 

2.5 

.9866 

.9869 

.9871 

.9874 

.9876 

.9879 

.9881 

.9884 

9886 

.9888 

2.6 

.9890 

.9892 

.9895 

.9897 

.9899 

.9901 

.9903 

.9905 

.9906 

.9908 

2.7 

.9910 

.9912 

.9914 

.9915 

.9917 

.9919 

.9920 

.9922 

9923 

.9925 

2.8 

.9926 

.9928 

.9929 

.9931 

.9932 

.9933 

.9935 

.9936 

.9937 

.9938 

2.9 

.9940 

.9941 

.9942 

.9943 

.9944 

.9945 

.9946 

.9947 

.9949 

.9950 


3.0 .9951 

4.0 .9993 
£.0 .9999 

^ The figures in the body of the table are values of r oorresponding to 
0-values read from the scales on the left and top of the table. 



Appendix Table V ^ 


Table of 


n 

P = .99 

.95 

.50 

.10 

.05 

.02 

.01 

1 

.000157 

.00393 

.455 

2 706 

3 841 

5 412 

6.635 

2 

.0201 

.103 

1.386 

4.605 

5.991 

7.824 

9.210 

3 

.115 

.352 

2.366 

6 251 

7 815 

9 837 

11.341 

4 

.297 

.711 

3.357 

7.779 

9 488 

11 668 

13 277 

5 

554 

1 145 

4.351 

9 236 

11 070 

13 388 

15 086 

6 

872 

1 635 

5.348 

10 645 

12 592 

15.033 

16 812 

7 

1.239 

2.167 

6.346 

12 017 

14 067 

16 622 

18 475 

8 

1.646 

2 733 

7.344 

13 362 

15 507 

18 168 

20.090 

9 

2.088 

3 325 

8 343 

14 684 

16 919 

19 679 

21.666 

10 

2 558 

3.940 

9.342 

15 987 

18 307 

21.161 

23.209 

11 

3 053 

4 575 

10.341 

17.275 

19 675 

22 618 

24 725 

12 

3 571 

5 226 

11.340 

18 549 

21 026 

24.054 

26.217 

13 

4.107 

6 892 

12.340 

19 812 

22 362 

25.472 

27.688 

14 

4.660 

6.571 

13.339 

21.064 

23 685 

26.873 

29. 141 

15 

5 229 

7.261 

14.339 

22.307 

24 996 

28.259 

30.578 

16 

5.812 

7.962 

15,338 

23 542 

26 296 

29 633 

32.000 

17 

6.408 

8 672 

16.338 

24 769 

27 587 

30 995 

33.409 

18 

7.015 

9.390 

17.338 

25 989 

28 869 

32 346 

34.805 

19 

7.633 

10.117 

18.338 

27.204 

30.144 

33.687 

36.191 

20 

8 260 

10.851 

19.337 

28.412 

31.410 

35 020 

37.566 

21 

8.897 

11.591 

20.337 

29.615 

32.671 

36.343 

38.932 

22 

9.542 

12.338 

21.337 

30.813 

33.924 

37.659 

40 289 

23 

10.196 

13 091 

22.337 

32.007 

35.172 

38.968 

41.638 

24 

10 856 

13 848 

23 337 

33.196 

36 415 

40 270 

42.980 

26 

11.524 

14.611 

24,337 

34.382 

37.652 

41.566 

44.314 

26 

12.198 

15.379 

25.336 

35.563 

38.885 

42.856 

45.642 

27 

12.879 

16.151 

26.336 

36.741 

40.113 

44.140 

46.963 

28 

13.565 

16.928 

27,336 

37.916 

41.337 

45 419 

48.278 

29 

14.256 

17 708 

28.336 

39 087 

42.557 

46 693 

49.588 

30 

14.953 

18.493 

29.336 

40.256 

43.773 

47 962 

50.892 


For larger values of n, the expression — \/2n — 1 may be used 
as a nonnal deviate with unit standard error. 


* Excerpts from Table IH, R. A, Fisher, Statistical Methods for Research 
Workers, Hiese excerpts are printed here throu^ the courtesy of Dr. Fisher 
and bis publishers, Oliver and Boyd, of Edinburgh. 
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Appendix Table VI ^ 

1 Per Cent Points of the Distribution of z 


— 

VaVuea of m 


1 

2 

3 

4 

5. 

6 

8. 

12 

24 

00 


1 

4 

1635 

4 2585 

4 2974 

4 3175 

4 3297 

4 3379 

4 3482 

4 3585 

4 3689 

4 3794 


2 

2 

2950 

2 2976 

2 2084 

2 2088 

2 2991 

2 2992 

2 2994 

2 2997 

2 2999 

2 3001 


3 

1 

7649 

1 7140 

1 6915 

1 6786 

1 6703 

1 6645 

1 6569 

1 6489 

1 6404 

1 6314 


4 

1 

5270 

1 4452 

1 4075 

1 3856 

1 3711 

1 3609 

1 3473 

1 3327 

1 3170 

1 3000 


6 

1 

3943 

1 2929 

1 2449 

1 2164 

1 1974 

1 1838 

1 1644 

1 1457 

1 1239 

1 0907 


6 

1 

3103 

1 1955 

1 1401 

1 1068 

1 0843 

1 0680 

1 0460 

1 0218 

9948 

9643 


7 

1 

2526 

1 1281 

1 0672 

1 0300 

1 0048 

9864 

9614 

.9335 

0020 

8658 


8 

1 

2106 

1 0787 

1 0136 

9734 

0459 

9259 

8983 

.8673 

.8310 

7904 


9 

1 

1786 

1 0411 

9724 

9200 

9006 

8791 

.8494 

8157 

.7760 

7305 


10 

1 

1535 

1 0114 

9399 

8954 

8646 

8419 

.8104 

7744 

7324 

.6816 


11 

1.1338 

9874 

9136 

.8674 

8354 

8116 

7785 

7405 

6958 

6408 


12 

1 

.1166 

9677 

8010 

8443 

8111 

7864 

.7520 

7122 

6640 

6061 


13 

1 

1027 

9511 

8787 

8248 

7907 

7652 

7295 

6882 

6386 

.6761 


14 

1 

0909 

9370 

8581 

.8082 

7732 

7471 

7103 

.6675 

6150 

5500 

■1 

10 

1 

0807 

9249 

8448 

7939 

7582 

7314 

6937 

6496 

5961 

.5269 


16 

1 

0719 

9144 

8331 

7814 

7450 

7177 

6791 

.6330 

5786 

.5064 

17 

1 

0641 

0061 

8220 

7706 

7335 

7067 

6663 

6190 

.5630 

.4879 

1 

18 

1 

0572 

8970 

.8138 

7607 

7232 

6960 

.6549 

.6075 

5516 

.4712 


19 

1 

0611 

8897 

8067 

7621 

7140 

6854 

6447 

.5964 

.5366 

.4560 

§ 

20 

1 

0457 

8831 

7085 

7443 

7058 

6768 

6355 

5864 

5253 

.4421 


21 

1 

0408 

8772 

,7020 

7372 

.6984 

6690 

.6272 

6773 

.5150 

4294 


22 

1 

0363 

8719 

7800 

7309 

6916 

6620 

.6190 

.5601 

.5050 

4176 


23 

1 

0322 

8670 

7806 

7251 

6855 

6555 

6127 

1 5615 

1 4969 

.4068 


24 

1 

0285 

8626 

7767 

7197 

6799 

6496 

.0064 

1 .5545 

.4890 

3967 


25 

1 

0261 

8585 

7712 

7148 

,6747 

.6442 

.6006 

1 .5481 

4816 

3872 


26 1 

1 

0220 

.8548 

7670 

7103 

6699 

.6392 

.5952 

.5422 

4748 

.3784 


27 

1 

0191 

8513 

.7631 

.7062 

1 .6655 

6346 

.5902 

5367 

.4685 

3701 


28 

1 

0164 

8481 

,7696 

.7023 

6614 

.6308 

.5856 

.5316 

4620 

3624 


29 

1 

0139 

8451 

7562 

6987 

6676 

' 6263 

.5813 

.5269 

.4570 

.3550 


30 

1 

0116 

8423 

.7531 

.6964 

.6540 

6226 

.5773 

5224 

.4510 

3481 


60 

.9784 

.8025 

.7086 

.6472 

.6028 

.5687 

.5189 

.4574 

3746 

.2352 


QO 


9462 

7686 

6661 

5999 

5522 

.5152 

.4604 

3908 

.2913 

0 


1 From Table VI, R. A. Fisher, SUxtiatical Methods for Research W(yrkers* 
This table is printed here tlirough the courtesy of Dr. Fisher and his pub- 
lishers, Ohver and Boyd, of Edinburgh. 
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Appendix Table VII ^ 

6 Per Cent Points of the Distribution of z 



Values of m 


1. 

2. 

3 

4. 

5 

6 

8 

12 


00 


1 

2 5421 

2 6479 

2 6870 

2 7071 

2 7194 

2 7276 

2 7380 

2 7484 

2 7588 

2.7693 


2 

1 4502 

1 4722 

1.4768 

1 4787 

1 4800 

1 4808 

1 4810 

1 4830 

1 4840 

1 4851 


3 

1 1577 

1 1284 

1 1187 

1 1051 

1 0994 

1 0953 

1 0890 

1 0842 

1 0781 

1 0716 


4 

1 0212 

.9600 

9420 

9272 

9168 

0093 

8993 

8885 

8767 

8630 


5 

.9441 

.8777 

.8441 

.8286 

8097 

.7997 

.7862 

7714 

.7560 

7368 


6 

8948 

8188 

7798 

7558 

.7394 

.7274 

7112 

6031 

6729 

6499 


7 

8606 

7777 

7347 

7080 

6896 

6761 

6576 

6369 

6134 

5862 


8 

.8365 

.7475 

7014 

6725 

6525 

.6378 

.6175 

5045 

5682 

5371 


9 

8163 

7242 

6757 

.6450 

6238 

6080 

5862 

5613 

5324 

4970 


10 

.8012 

7058 

6553 

6232 

6006 

5843 

.5611 

.5346 

.5035 

4057 


11 

.7880 

6909 

.6387 

.6055 

5822 

.5648 

.5406 

5126 

4705 

4387 


12 

.7788 

6786 

6250 

.5907 

5666 

.5487 

5234 

.4041 

4502 

4156 


13 

.7703 

6682 

.6134 

5783 

5535 

.5350 

5089 

4785 

4419 

.3957 


14 

7630 

6504 

6036 

5677 

5423 

5233 

4964 

4640 

4269 

3782 

s 

15 

.7668 

6518 

.5950 

5585 

5326 

.5131 

4855 

4532 

4138 

3628 


16 

7514 

6451 

5876 

.5505 

.5241 

5042 

4760 

4428 

4022 

3490 


17 

7466 

6393 

.5811 

.5434 

.5166 

4064 

4676 

.4337 

.3919 

3366 

§ 

18 

.7424 

6341 

5753 

5371 

5099 

4804 

4602 

4255 

3827 

3253 


19 

.7386 

.6205 

.5701 

5315 

.5040 

4832 

4535 

4182 

3743 

3151 

s 

20 

7352 

.6254 

5654 

5265 

4986 

4776 

4474 

4110 

.3668 

3057 


21 

7322 

6216 

5612 

.5210 

4938 

4725 

4420 

.4055 

3599 

2971 


22 

.7294 

.6182 

.5574 

.5178 

,4894 

.4670 

4370 

4001 

3536 

2892 


23 

.7269 

.6151 

5540 

5140 j 

4854 

.4636 

4325 

.3950 

.3478 

2818 


24 

.7246 

.6123 

5508 

5106 I 

4817 

4598 

.4283 

3904 

3425 

2749 


25 

.7226 

6097 

.5478 

5074 

.4783 

.4502 

4244 

3862 

3376 

2685 


26 

.7205 

.6078 

.5451 

5045 

4752 

.4529 

4209 

3823 

3330 

2625 


27 

.7187 

.6051 

5427 

5017 

4723 

.4490 

4176 

3786 

.3287 

2569 


28 

.7171 

6080 

.5403 

.4992 

4696 

.4471 

.4146 

3752 

3248 

2516 


29 

.7155 

.6011 

.5382 

.4969 

4671 

4444 

4117 

.3720 

.3211 

.2466 


30 

.7141 

.5094 

.5362 

.4947 

,4648 

.4420 

4090 

.3601 

8176 

.2419 


60 

.6933 

.6738 

.5073 

.4682 

.4311 

.4064 

.3702 

.3255 

.2654 

1644 


-0 

.6729 

5486 

.4787 

4319 ' 

3974 

8706 

3309 

2804 

2085 

0 


1 From Table VI, R. A. Fisher, Statistical Methods for Research Workers, 
This table is printed here thro^h the courtesy of Dr. Fisher and his pub- 
lishers, Oliver and Boyd, of Edinburgh. 
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Appendix Table VIII 

Squares of the Natural Nurnbers from 100 to 999 


-SQUARE OF- 


n 


n 


R 

+ 

1 

n 

■ + 

2 


t + 

3 

fS 

1 + 

4 

r, 

t + 

5 

n 

t + 

6 

fi 

t + 

7 

« 

i-h 

8 

n 

■ + 

9 

100 

1 

00 

00 

1 

02 

01 

1 

04 

04 

1 

06 

09 

1 

08 

16 

1 

10 

25 

1 

12 

36 

1 

14 

49 

1 

16 

64 

1 

18 

81 

110 

1 

21 

00 

1 

23 

21 

1 

25 

44 

1 

27 

69 

1 

29 

96 

1 

32 

25 

1 

84 

56 

1 

36 

89 

1 

39 

24 

1 

41 

61 

120 

1 

44 

00 

1 

46 

41 

1 

48 

84 

1 

51 

29 

1 

53 

76 

1 

56 

25 

1 

58 

76 

1 

61 

29 

1 

63 

84 

1 

66 

41 

130 

1 

69 

00 

1 

71 

61 

1 

74 

24 

1 

76 

89 

1 

79 

66 

1 

82 

25 

1 

84 

96 

1 

87 

69 

1 

90 

44 

1 

93 

21 

140 

1 

96 

00 

1 

98 

81 

2 

01 

64 

2 

04 

49 

2 

07 

36 

2 

10 

25 

2 

13 

16 

2 

16 

09 

2 

19 

04 

2 

22 

01 

160 

2 

26 

00 

2 

28 

01 

2 

31 

04 

2 

34 

09 

2 

37 

16 

2 

40 

25 

2 

43 

36 

2 

46 

49 

2 

40 

64 

2 

62 

81 

160 

2 

66 

00 

2 

59 

21 

2 

62 

44 

2 

65 

69 

2 

68 

96 

2 

72 

25 

2 

76 

56 

2 

78 

89 

2 

82 

24 

2 

85 

61 

170 

2 

89 

00 

2 

92 

41 

2 

96 

84 

2 

99 

29 

3 

02 

76 

3 

06 

25 

3 

09 

70 

3 

13 

29 

3 

16 

84 

3 

20 

41 

180 

3 

24 

00 

3 

27 

61 

3 

31 

24 

3 

34 

89 

3 

38 

56 

3 

42 

25 

3 

45 

96 

3 

49 

69 

3 

53 

44 

3 

67 

21 

100 

3 

61 

00 

3 

64 

81 

3 

68 

64 

3 

72 

49 

3 

76 

36 

3 

SO 

25 

3 

84 

16 

3 

88 

00 

3 

02 

04 

3 

96 

01 

200 

4 

00 

00 

4 

04 

01 

4 

08 

04 

4 

12 

09 

4 

16 

16 

4 

20 

25 

4 

24 

36 

4 

28 

49 

4 

32 

64 

4 

36 

81 

210 

4 

41 

00 

4 

46 

21 

4 

49 

44 

4 

63 

69 

4 

67 

06 

4 

62 

25 

4 

66 

56 

4 

70 

89 

4 

75 

24 

4 

79 

61 

220 

4 

84 

00 

4 

88 

41 

4 

92 

84 

4 

97 

29 

5 

01 

76 

5 

06 

25 

5 

10 

76 

5 

15 

20 

5 

19 

84 

5 

24 

41 

280 

6 

29 

00 

6 

33 

61 

6 

38 

24 

6 

42 

89 

6 

47 

56 

6 

52 

25 

5 

66 

96 

5 

61 

69 

5 

66 

44 

5 

71 

21 

240 

5 

76 

00 

6 

80 

81 

5 

86 

64 


90 

49 

6 

95 

36 

6 

00 

25 

6 

05 

16 

6 

10 

09 

6 

15 

04 

6 

20 

01 

260 

6 

25 

00 

6 

30 

01 

6 

36 

04 

6 

40 

09 

6 

46 

16 

6 

50 

25 

6 

55 

36 

6 

60 

40 

6 

65 

64 

6 

70 

81 

260 

6 

76 

00 

6 

81 

21 

6 

86 

44 

6 

01 

69 

6 

96 

96 

7 

02 

25 

7 

07 

56 

7 

12 

89 

7 

18 

24 

7 

23 

61 

270 

7 

29 

00 

7 

34 

41 

7 

39 

84 

7 

45 

29 

7 

50 

76 

7 

56 

25 

7 

61 

76 

7 

67 

29 

7 

72 

84 

7 

78 

41 

280 

7 

84 

00 

7 

89 

61 

7 

95 

24 

S 

00 

89 

8 

06 

56 

8 

12 

25 

8 

17 

96 

8 

23 

69 

H 

29 

44 

8 

35 

21 

290 

8 

41 

00 

8 

46 

81 

8 

62 

64 

8 

68 

49 

8 

64 

36 

8 

70 

25 

8 

76 

16 

8 

82 

09 

8 

88 

04 

8 

04 

01 

300 

9 

00 

00 

9 

06 

01 

9 

12 

04 

9 

18 

09 

9 

24 

16 

9 

30 

25 

9 

36 

36 

9 

42 

49 


48 

04 

0 

54 

81 

810 

9 

61 

00 

9 

67 

21 

9 

73 

44 

9 

79 

69 

9 

85 

96 

9 

92 

25 

9 

98 

56 

10 

04 

SO 

10 

11 

24 

10 

17 

61 

820 

10 

24 

00 

10 

30 

41 

10 

36 

84 

10 

43 

29 

10 

49 

76 

10 

56 

25 

10 

62 

76 

10 

60 

20 

10 

76 

84 

10 

82 

41 

330 

10 

80 

00 

10 

96 

61 

11 

02 

24 

11 

08 

89 

11 

16 

66 

11 

22 

25 

11 

28 

96 

11 

35 

69 

11 

42 

44 

11 

49 

21 

340 

11 

66 

00 

11 

62 

81 

11 

69 

64 

11 

76 

49 

11 

83 

36 

11 

00 

25 

11 

97 

16 

12 

04 

09 

12 

11 

04 

12 

18 

01 

360 

12 

25 

00 

12 

32 

01 

12 

39 

04 

12 

46 

09 

12 

53 

16 

12 

60 

25 

12 

67 

36 

12 

74 

49 

12 

81 

64 

12 

88 

81 

860 

12 

06 

00 

13 

03 

21 

13 

10 

44 

13 

17 

69 

13 

24 

06 

13 

32 

25 

13 

39 

66 

13 

46 

80 

13 

54 

24 

13 

61 

61 

370 

13 

69 

00 

13 

76 

41 

13 

83 

84 

13 

01 

29 

13 

98 

76 

14 

06 

25 

14 

13 

76 

14 

21 

20 

14 

28 

84 

14 

36 

41 

380 

14 

44 

00 

14 

51 

61 

14 

69 

24 

14 

66 

89 

14 

74 

56 

14 

82 

25 

14 

89 

96 

14 

97 

69 

15 

05 

44 

15 

13 

21 

390 

16 

21 

00 

16 

28 

81 

15 

36 

64 

16 

44 

49 

15 

52 

36 

15 

60 

25 

15 

68 

16 

15 

76 

09 

15 

84 

04 

15 

92 

01 

400 

16 

00 

00 

16 

08 

01 

16 

16 

04 

16 

24 

09 

16 

32 

16 

16 

40 

25 

16 

48 

36 

16 

56 

40 

16 

64 

64 

16 

72 

81 

410 

16 

81 

00 

16 

89 

21 

16 

97 

44 

17 

05 

69 

17 

13 

06 

17 

22 

25 

17 

30 

66 

17 

38 

80 

17 

47 

24 

17 

55 

61 

420 

17 

64 

00 

17 

72 

41 

17 

80 

84 

17 

89 

29 

17 

97 

76 

18 

06 

25 

18 

14 

76 

18 

23 

29 

IS 

31 

84 

18 

40 

41 

480 

18 

49 

00 

18 

67 

61 

18 

66 

24 

18 

74 

89 

18 

83 

66 

18 

92 

25 

19 

00 

06 

19 

00 

69 

19 

18 

44 

19 

27 

21 

440 

19 

36 

00 

19 

44 

81 

19 

63 

64 

19 

62 

49 

19 

71 

36 

10 

80 

25 

19 

89 

16 

19 

98 

09 

20 

07 

04 

20 

16 

01 

460 

20 

25 

00 

20 

34 

01 

20 

43 

04 

20 

62 

09 

20 

61 

16 

20 

70 

25 

20 

79 

36 

20 

88 

40 

20 

97 

64 

21 

06 

81 

460 

21 

16 

00 

21 

26 

21 

21 

34 

44 

21 

43 

69 

21 

52 

96 

21 

62 

25 

21 

71 

56 

21 

80 

89 

21 

90 

24 

21 

99 

61 

470 

22 

09 

00 

22 

18 

41 

22 

27 

84 

22 

37 

29 

22 

46 

76 

22 

56 

25 

22 

65 

76 

22 

76 

29 

22 

84 

84 

22 

94 

41 

480 

23 

04 

00 

23 

13 

61 

23 

23 

24 

23 

32 

80 

23 

42 

56 

23 

52 

25 

23 

61 

06 

23 

71 

60 

23 

81 

44 

23 

91 

21 

490 

24 

01 

00 

24 

10 

81 

24 

20 

64 

24 

30 

49 

24 

40 

36 

24 

50 

25 

24 

60 

16 

24 

70 

09 

24 

80 

04 

24 

90 

01 

600 

26 

00 

00 

25 

10 

01 

25 

20 

04 

25 

30 

09 

25 

40 

16 

25 

50 

25 

25 

60 

36 

26 

70 

49 

25 

80 

64 

25 

90 

81 

610 

26 

01 

00 

26 

11 

21 

26 

21 

44 

26 

31 

69 

26 

41 

06 

26 

62 

25 

26 

62 

56 

20 

72 

89 

26 

83 

24 

26 

93 

61 

620 

27 

04 

00 

27 

14 

41 

27 

24 

84 

27 

36 

29 

27 

45 

76 

27 

56 

25 

27 

66 

76 

27 

77 

29 

27 

87 

84 

27 

98 

41 

680 

28 

09 

00 

28 

19 

61 

28 

30 

24 

28 

40 

80 

28 

51 

56 

28 

62 

25 

28 

72 

96 

28 

S3 

69 

28 

94 

44 

29 

06 

21 

640 

29 

16 

00 

29 

26 

81 

29 

87 

64 

29 

48 

49 

29 

69 

36 

29 

70 

25 

29 

81 

16 

29 

92 

09 

30 

03 

04 

30 

14 

01 
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Appendix Table IX 


Sums of the First Three Powers of the Natural Numbers from 1 to 60 


u 

S(n) 

S(«“) 

2(«“) 

n 

S(/0 


2(»») 

1 

1 

1 


1 

26 

351 

6 201 

123 201 

2 

3 

5 


9 

27 

378 

6 930 

142 884 

8 

6 

14 


36 

28 

406 

7 714 

164 836 

4 

10 

30 


100 

29 

435 

8 555 

189 225 

5 

15 

55 


225 

30 

465 

9 455 

216 225 

6 

21 

91 


441 

31 

496 

10 416 

246 016 

7 

28 

140 


784 

32 

528 

11 440 

278 784 

8 

36 

204 

1 

296 

33 

561 

12 529 

314 721 

9 

45 

285 

2 

025 

34 

595 

13 685 

354 025 

10 

55 

385 

3 

025 

35 

630 

14 910 

396 900 

11 

66 

506 

4 

356 

36 

666 

16 206 

• 443 556 

12 

78 

650 

6 

084 

37 

703 

1 17 575 

494 209 

13 

91 

819 

8 

281 

38 

741 

19 019 

549 081 

14 

105 1 

1 015 

11 

025 

39 

780 

1 20 540 

608 400 

15 

120 

1 240 

14 

400 

40 

820 

1 22 140 

672 400 

16 

136 

1 496 

18 

496 

41 

861 

23 821 

741 321 

17 

153 

1 785 

23 

409 

42 

903 

25 585 

815 409 

18 

171 

2 109 

29 

241 

43 

946 

27 434 

894 916 

19 

190 

2 470 

36 

100 

44 

990 

29 370 

980 100 

20 

210 ; 

2 870 

44 

100 

45 

1 035 

31 395 

1 071 225 

21 

231 : 

3 311 

53 

361 

46 

1 081 

33 511 

1 168 561 

22 

253 

3 795 

64 

009 

47 

1 128 

35 720 

1 272 384 

23 

276 

4 324 

76 

176 

48 

1 176 

38 024 

1 382 976 

24 

300 

4 900 

90 

000 

49 

1 225 

40 425 

1 500 625 

25 

325 i 

5 525 

105 

625 

50 

1 275 

42 925 I 

1 625 625 
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Appendix Table X 




1 44 43 42 

2 8 8 86 8.4 
8 13 2 12 9 12 6 

4 176 172 168 

5 22.0 215 210 

6 26.4 26 8 26.2 

7 ^ 8 30 1 29 4 

8 362344336 

9 39 6 38.7 37 8 


1 41 40 39 

2 82 80 78 

3 12.3 12 0117 

4 16.4 16 016 0 
6 20.5 20.019 5 

6 24.6 240 23 4 

7 28.7 28.0 27.3 

8 32.8 32 0 31.2 

9 36.9 36.0 85 1 


1 3.8 3.7 3 6 

2 7.6 74 7.2 

3 11411110 8 

4 15.2 14.8 14.4 

6 19 0 18 5 18.0 
6 22 8 22 2 21 6 

7 26 0 260 26 2 

8 30.4 29.6 28.8 

9 34.2 33.3 32.4 


1 3.5 34 83 

2 7.0 6 8 6.6 
8 10.6102 9 0 
4 14.0 13.6 13 i 
6 17.6 17.016 5 

6 21.0 20.4 10 8 

7 24.5 23.8 231 

8 28.0 27.2 26 A 

9 31.5 30.6 29.7 


82 81 80 

1 32 81 8.0 

2 6.4 6 2 6.0 

3 96 03 90 

4 12.812.412 0 

5 16.015.5 15 0 

6 19.218 618.0 

7 22.4 21.7 21 0 

8 25 6 24.8 24 0 

9 28 8 27.9 27.0 
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by production cliauges, 305; pnj- 
war relation to stockrprice cycles, 
390 ff,; post-war reflation, 35M>; dura- 
tion of, 467, 481; effect on price, 
475; scfi also Cyclical variation 

Census of manufactures, 309, 317, 371 
Center of gravity, 102 
Central tendency, 97, 99; measurcH 
of, 101 ff. 

Certainty, in probability theory, 426 
Chaddock, R, E., 84 
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Chain index, 216 

Chance, law of, see Normal law of 
error and Probability 
Characteristic, logarithmic, 24 
Charlier check, 149 
Charts, construction of, 32 ff.; for 
comparison of frequencies, 41 ff.; 
representation of component parts, 
42 ff.; cumulative, 44 ff.; Gannt 
progress chart, 46; see also Graphic 
presentation 

ChiHsquare, 618; distribution of, 618, 
626; in testing goodness of fit, 
626 ff.; in testing homogeneity, 
633 ff.; in testmg mdependcnce of 
principles of classification, 630 ff.; 
table of values, 625, 703 
Classification of quantitative mate- 
rial, see Organization of data 
Classification, principles of, 53 ff.; test- 
ing significance of, 494 ff.; testing 
independence of, 630 ff., 681 ff. 
Classified data, see Grouping of data 
Class mterval, 53, 57 ff., 104, 347, 
357; in locating the mode, 117; in 
computing the standard deviation, 
149 

Coefficient of correlation, see Cor- 
relation coefficient 

Coefficient of multiple correlation, see 
Correlation, multiple 
Coefficient of regression, see Regres- 
sion 

Coefficient of variation, 156 
Coin tossing, 91 

Column dia^am, 41, 64 ff., 66, 73, 91 
Commodities, included in price- 
change study, 209 

Compound interest, law of, 30 ff., 40; 
curve of, 267 

Concurrence of cycles, 390 ff. 
Constants, 12, 244 ff . 

Controls, in sampling procedure, 462 
Codrdinate geometry, 8 ff. 

Correction, of index numbers, 311; 
of the correlation index, 412; of the 
standard error, 542 
Correction factor, in computing the 
correlation coefficient, 339; in com- 
puting the mean, 106, 351, 392; see 
also Bias 

Correlation, coefficient of, 334 ff., 520; 


calculation of, 337 ff., 353, 364, 648; 
product-moment method, 349 ff.; 
construction of table, 340 ff.; sum- 
mary of correlation procedure, 
366 ff.; limitations of, 370; relation 
to correlation ratio, 422; tests for 
the significance of, 502 ff., 611; sig- 
nificance of difference of coef- 
ficients, 616; standard error of, 474; 
derived from small samples, 610; 
weighted average of, 617, 618; table 
of significant values, 612, 701; 
table of relations to the z function, 
702 

Correlation, index of, 408 ff., 520; 
formula for, 408, 412, 647; sig- 
nificance of, 409; computation of, 
410, 412; stodard error, 477 

Correlation, linear, 325 ff.; lines of 
regression, 359 ff.; distortion in 
non-normal distributions, 372 ff . ; 
of grouped data, 340 ff.; in the 
measurement of time secjuence, 
389 ff.; see also Correlation, coeffi- 
cient of 

Correlation, multiple, 530 ff.; prelim- 
inary an^ysis, 533; use of multi- 
variate estimating equation, 536 ff.; 
coefficient of, 543 ff.; correction 
for number of constants involved, 
544; test of significance of the 
coefficient of, 544; standard error of 
the coefficient of, 545; application 
of method, 547; limitations of proce- 
dure, 563; simplification of normal 
equations, 652 ff . 

Correlation, non-linear, 404 ff.; use of 
reciprocal relations, 582; use of 
logarithmic relations, 575 ff. 

Corrdation of time Rerios, 380 ff.; of 
secular trends, 381; of deviations 
from trend, 385; dangers of pro- 
cedure, 388, 389; concurrent cycles, 
391; use of moving average in, 
398; of short term fluctuations, 
398 ff. 

Correlation, partial, 584 ff.; relation 
to simple correlation, 549; system- 
atic computation of coefficients 
of, 554 ff.; standard error of the 
coefficient of, 560 

Correlation, rank, 374 ff., 521 
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Correlation ratio, 413 ff., 519; for- 
mula for, 414; computation of, 415, 
417 ff.; .significance of, 421; cor- 
rection of, 421; relation to correla- 
tion coefficient, 422 
Cost of living index, 221 
Cotton statistics, 161; correlation of 
price and production, 384, 400 
Coverage, of production index num- 
bers, 316 
Cox, G. M., 688 
Criteria of curve type, 444 
Crum, W. L., 302 

Cumulative charts, 44 ff.; arrange- 
ment of data, 77; frequency curve, 
80 

Curve fitting, by least squares, 246 ff.; 
linear, 246; parabolic, 253, 260 ff.; 
of linear business senes, 257; use of 
logs in, 264, 269 

Curve type, criteria of, 444; see also 
Functional relationship 
Cutts, Jesse M., 218 
Cycles, correlation of, 382, 380 
Cycles of reference, see Koferonce 
cycles 

Cyclical fluctuations, c,orrclatioii of, 
382 ff.; sec aUo Cyclical vanation 
Cyclical variation, 230, 302, 380, 526; 
removal by moving averages, 236, 
284; measurement of, 293 ff.; in 
industrial activity, 312 ff 390 

Davenport, Donald H., 219, 254, 437, 
573 

Davenport, K., 415 
Day, rkimund K., construction of in- 
dex of physical volume, 310 
Decile, graphic location of, 114 
Deflation, in time scries analysis, 279 
Degrees of freedom, in variance 
analysis, 491, 496, 50*1 ff., 512, 517, 
528; in statistical induction, 604, 
611 

Degree of rdationship, see Relation- 
ship, measurement of 
Doming, W. E., 469, 470; Chi-square 
test, 629 

Dennis, Samuel J., 218 
Dependent variable, see Variable 
Depreciation, 79 
Deiiva14ve, partial, 630, 643 


Descartes, Ren4, 8 

Description, of frequency distribu- 
tions, 86, 137 ff., 448 ff.; methods 
of, 99 ff.; statistical, 452 ff. 

Deviate, normal, 437 
Deviation, 97; probable, 152; from 
trend, 263, 385, 396; from mean, 
347; vertical and horizontal, 363; 
from moving averages, 398; from 
means of arra}^, 417; quartile, see 
Quartilc deviation; mean, see Mean 
deviation; standard, see Standard 
deviation; root-mean-square, see 
Root-mean-square deviation 
Differences, finite, 275 
Discount rates, relation between, 
340 ff., 361 ff. 

Dispersion, 99, 115, 137, 414; zone of, 
89 ff., 349; measures of, 137 ff., 
330, 335; in correlation analysis, 
490; test of differences in disper- 
sion, 492; see also Variation and 
Scatter 

Distribution, frequency, 50 ff.; de- 
scription of, 137ff.; general char- 
acteristics of, 97 ff.; of income, 71; 
of sawmills, 84; of heights, 87; of 
astronomical errors, 89; of artillery 
shots, 90; of coin throws, 91; of 
economic data, 93 ff.; of exchange 
rates, 95; of wage earners, 96; of 
bonds, 116; of stock prices, 125; 
see also List of charts 
Doolittle solution, of normal equa- 
tions, 540, 655 

Dow-Jones index number, 393 

Edgeworth, F. Y., 204; binomial ex- 
pansion, 432 

Eldorton, W. P., Chi-squarc tabic, 624 
Equation of regression, see Regression 
Xteor, normal curve of, see Normal 
law of error 

Error, sampling, eee Standard error 
Plstimate, making of, 332 ff., 566 ff.; 

zone of, 349, 571 ff., 590 ff. 
Exchange rates, distribution of, 94 
Expected value, 294 
Exponential curve, 19, 28, 258, 266, 
271; modified, 272, 667; see also 
Logarithmic curve 
Exponent, logarithmic, 23 
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Exports, statistics of, 36, 38 
Extrapolation, 264, ^7, 675 
Ezekid, Mordecai, 412, 477, 547, 564, 
J 652; multiple correlation analysis, 
537 ff.; correction of standard error, 
542; correction of the correlation 
coefficient, 544 

Factor reversal test, 199 
Falkner, Helen D., 288 
Farm price mdex number, 222 £P. 
Federd Reserve Board, index of pro- 
duction, 316 ff. 

Fisher, Aine, 448 

Fisher, Irving, 181; time reversal test, 
190; weighted index numbers, 195, 
196, 204; factor reversal test, 199; 
ideal index number, 201 
Fisher, R. A., 270, 479, 603; statistical 
population, 456; null hypothesis, 
475; analysis of variance, 490 ff.; 
z table, 499, 704-5; extension of z 
table, 518; i table, 603, 700; sig- 
nificance of the correlation coeffi- 
cient, 611; Chi-square table, 625, 
628, 703 

Frazier, Edward K., 105 
Frequency curve, 41, 82; polygon, 
41, 67, 85, 88, 91, 93 
Frequency distribution, 50 ff.; pur- 
pose of, 56; comparison of, 86; gen- 
eral characteristics of, 97 ff.; see 
also Distribution 

Frequency, theoretical and actual, 
431 ff. 

Friedman, Milton, 521 
Functional relationship, 12, 389; lin- 
ear, see Linear relationship; para- 
bolic, see Parabolic relation^ip 

Galton, Francis, lines of regression, 
359 

Gannt, H. L., progress chart, 46 
Gauss, Karl Friedrich, normal law of 
error, 435 

Geometric mean, 125 ff.; definition of, 
125; computation of, 126; charac- 
teristics of, 127, 135; as measure of 
central tendency, 129; as average 
of relative prices, 185; of logarith- 
mic observations, 584 
Geometric progression, 18, 28, 271, 
275,669 


Glover, James W., 269, 437 
Gompertz curve, 272, 671 
Goodness of fit, 447; criteria of, 276; 

Chi-square t^t of, 626 ff. 

Graphic method, of locating aver- 
ages, 120 ff.; in multiple correla- 
tion, 564 

Graphic presentation, Sff.; of fre- 
quency distnbutions, 63; of time 
series, 227 

Grouping of data, 53, 112; ungrouped 
data, 109; effect on mode, 115; in 
correlation tables, 340, 354 
Growth curves, Gompertz, 272, 671; 
modified exponential, 272, 667 ff.; 
logistic, 272, 675 ff. 

Hall, Lincoln W., 288 
Harmonic equation, 579 
Harmonic mean, 132 ff.; character- 
istics of, 135; of relative prices, 
186; of reciprocal observations, 585, 
587 

Hart, Hornell, reliability of a |)er- 
ceiitage, 483 

Height distribution, 87, 360 
High contact, of frequency distribu- 
tions, 443 

Histogram, 64 ff.; see also Column 
diagram 

Homogeneity, 487; tiists for, 120, 
630 ff.; in time series, 301; in sam- 
pling procedure, 462, 007; Chi- 
square test of, 633 ff . 

Hotelling, Harold, 378, 479 
Hyperbolic curve, 16, 28, 569 

Ideal index, 201; for the measurement 
of production, 307 

Income distribution, statistics of, 71, 
97, 102 

Independence, tests of, 630 ff . 
Independent variable, see Variable 
Index numbers, 18; nature of, 161 ff.; 
“ideal,” 201; use of aggregates, 
165; of retail price, 220; of cost of 
living, 221 ; of farm price, 222 ff.; of 
seasonal variation, 287 ff.; of in- 
dustrial activity, 312 ff., 390, 393; 
of stock prices, 390 ff. 

Index numbers of production, ;)05 ff.; 
unadjusted, 300; adjuHted, 310; 
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Federal Reserve Board index, 316; 
derived from price indices, 319 ff.; 
of mdustnal productivity, 321 
Index numbers of wholesale prices, 
167, 216 ff.; purpose of, 170; con- 
struction of, 180 ff., 208 ff.; aggre- 
gative type, 181; arithmetic aver- 
age type, 183; weighted, 196; of 
farai crop prices, 182, 189; geomet- 
ric average typo, 185, 198; median 
type, 185; liarmonic type, 186; 
comparison of types, 1^; time 
reversal tost, 190; weighted types, 
193 ff., 198; alternative typos, 
204 ff.; commodities to be included 
in, 209 

Index of correlation, ace Correlation 
index 

Index of variability, 157 
Induction, statistical, 452 ff., 508 ff.; 
nature of, 453; measuivs of reliabil- 
ity, 464 ff.; generalizing fnim small 
samples, 598 ff. 

Industrial change, nKiusiiremonts of, 
322 

Inference, statistical, we Induction 
Interaction, of principle of classifica- 
tion, ()88 

Inteniolatiou, 70, 81, 277; for the 
median, 1 14; for the mode, 118; for 
montlily trend values, 273; in 
Fisher's z table, 500; doulilo intei^ 
jKilation, 507 

Irrigation, correlated with alfalfa 
yield, 404 ff. 

Jones, D. Q, binomial distribution, 
660 

Karsten, Karl G,, 278 
Kelley, Tmman L., 206; reliability of 
constants, 485 
Kendall, M.G.,629 
Keynes, J. M., random sampling, 461 
Killougli, H. B., 569 
Knibbs, Sir George, 214 
Kurtosis, 100, 187, 169 
Kurte, Edwin, 77 

Ijag, in time series analysis, 390 ff.; 
changes in different cycle phases, 
397 


Laspeyre’s index number, 193, 214 
Law of large numbers, 455 
Least squares, method of, 246 ff., 
638 ff.; applied to linear relations, 
246, 328, 354, 366, 509; applied to 
power curves, 260, 405; applied to 
logaritlimic curves, 264 ff.; in cor- 
relation analysis, 366, 373, 405 
Leptokurtic, 449 
Life table, 77 

Line of legression, see Regression 
Linear correlation, see Correlation, 
linear 

Linearity, test for, 423; by variance 
analysis, 508 ff.; see also Linear 
relationship 

Linear relationship, 14, 16, 26, 325 ff.; 
fittmg by least-squares, 246 ff . ; in 
business scries, 257, 268; between 
discount rates, 348; tests for, 423, 
477, 508 

Link relatives, 204 

lucgarithmic, equation, 26 ff., 563, 
569 ff., 671; mean, 128; sec also 
Geometric mciui; paper, 131, 227; 
deviation, 265; function of the cor- 
relation coefficient, 614 
Logarithms, common, 23 ff., 492, 672; 
use in computing the geometric 
moan, 125, 130; use in curve fitting, 
264 ff., 269; Naperian, 435, 492; 
App<mdix table X, 709 
Ixigistic curve, 272, 675 

Macaulay, F. R., 185, 244 
Malcnbaum, WilfrtHl, 565 
Mantissa, 24 

Manufactured goods, r51o in price 
movements, 213 

M<»ui, arithmetic, see Arithmetic 
moan; geometric, see Geometric 
mean; harmonic, see Harmonic 
moan 

Moan deviation, 139 ff. 

Moan product, 351, 358 
Moasurement of, central tendency, 
see Central tendency; relationship, 
see Relationship, etc. 

Median, definition of, 102; location of, 
109 ff.; computation of, 113; 
grapliic location of, 120 ff.; char- 
acteristics of, 134; rdation to mean 
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deviation, 140; of relative prices, 
185; standard error of, 472 
Menlman, Mansfield, 90 
Mesokurtic, 449 
Minor, J. R., 556 

Mitchell, W. a, 93, 173, 176, 212, 
242, 303; comparison of mdex num- 
bers, 209; business cycles, 467, 483 
Mode, 96; definition of, 101; location 
of, 115 ff.; graphic location of, 
120 ff.; chajacteristics of, 135 
Moments, of frequency distributions, 
440; about the mean, 442 
Monthly trend values, 272 ff . 
Mortality tables, 80 
Moving average, 234 ff.; application 
to non-linear series, 239; measure- 
ment of seasonal fluctuations, 
285 fi.; use in oorrelatmg cycles, 
398 fi. 

Mudgett, Bruce D., 216 
Multiple correlation, see Correlation, 
multiple 

Multiple frequency table, 289 

Napierian logarithm, 435 
National Bureau of Economic Re- 
search, 244, 320, 397; study of in- 
come distribution, 132; construc- 
tion of index numbers, 219; study 
of production change, 309 
National Industrial Conference Board, 
cost of living index, 221 
Natural number, 24, 28; table of 
squares of, 706; sums of powers, 708 
New York Census of Manufactures, 
309, 317, 371 

Non-linear correlation, see Correla- 
tion, non-linear 

Non-linear relationship, 404 fi.; see 
also Parabolic and exponential func- 
tion 

Normal deviate, 437, 599; table of, 
603,699 

Normal equations, for linear relation- 
ship, 249; parabolic, 254; of multi- 
variate rdationships, 537 . ; deriva- 
tion of, 639; formation of, 640; 
checks on, 648; Doolittle solution 
of, 654 

Normal law of error, 98, 153, 426 ff., 
435 ff.; assumptions underlying. 


436; its use, 438; economic appli- 
cation of, 440 ff.; cnteria for, 444; 
fitting the nomial curve, 445 ff.; 
distribution, 332, 371, 458; de- 
parture from, 374, 378; computa- 
tion of theoretical frequencies, 446; 
generalization of results, 448; of 
the distnbution of means, 464; use 
in measures of reliability, 464 ff.; 
area under, 437, 699; test of good- 
ness of fit of, 627 
Null hypothesis, 475 

Ogive, 80 ff., 85 

Organization of data, 51, 82, 100; in 
time series, 226 

Origin, arbitrary, 107, 351; at point 
of averages, 353, 365 
Orthogonal polynomials, 270 

Paasohe’s index number, formula for, 
195, 215 

Pabst, Margaret, 378, 479 
Parabolic curve, 16, 21, 27, 270, 577; 

see also Parabolic function 
Parabolic function, fitting of, 253 ff.; 
second degree, 260, 405; logarith- 
mic, 264, 269, 270; testing para- 
bolic hypothesis, 514 ff. 

Parameter, 457 

Pareto, Vilfredo, law of income dis- 
tribution, 132 

Partial correlation, see Correlation, 
partial 

Peake, E. G., 94 

Peakedness, 100; see also Kurtosis 
Pearl, Raymond, 271, 272; formation 
of normal equations, 642; logistic 
curve, 675 

Pearson, Karl, 156, 158, 254, 436; 
coefficient of correlation, 3!^5; cor- 
relation ratio, 413 ff.; curve tyiK*s, 
448; descriptive measures of fre- 
quency distributions, 448 ff.; statis- 
tical inference, 454; Chi-square dis- 
tribution, 618 ff., 626 
Percentages, difference between and 
significance of, 483 
Percentile, 114 

Periodic fluctuation, 230; removal by 
moving averages, 235; see aXio 
SeasoxiaL and cyclical variation 
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Periodic functioiii 21 
Persons, Warren M., 204; analysis of 
cycle lags, 390 ff . 

Platykurtic, 449 

Polynomial, orthogonal, 270; see also 
Parabolic function 
Population, statistical, 453, 454, 466 
Potential series, 21 

Power series, ^3; see also Parabolic 
function 

Price relative, 162; arithmetic aver- 
se of, 183 

Price, wholesale, 93, 168; index num- 
bers of, 161 ff., 167, 216 ff.; see also 
Index number; price ratios, 171 ff.; 
measurement of change of, 174; 
wholesale groups, 211; index of re- 
tail, 220; of farm products, 222; 
dej^tion of, 279; measurement of 
variation in, 493 

Probable error, 152, 155; of index 
numbers, 206; see also Standard 
error 

Probability, 603; principles of, 426 ff.; 
addition of, 427; measurement of, 
429; a priori, 431; empirical, 431; 
normal, 439, 469, 471; normal 
table of, 699; integral, 436 
Probability, curve 98; see also Normal 
law of error 
Probable value, 332 
Production, stiitistics of, 10, 35, 40, 
43, 47; of fuel, 163, 265; of crops, 
192; as meiwurwi by index numbers, 
305 ff. ; see also List of charts 
Product-moment method, 349 ff., 
368; for classified data, 354 ff. 
Projection, of trend values, 277, 402 
Puiposivc selection, in sampling pro- 
cedure, 462 

Quartile, 114; graphic location of, 
120 ff.; deviation, 150 ff., 154; 
stands^ error of, 473 


Random fluctuations, 231; removal 
by moving averages, 241 
Random sampling, 458, 461; see also 
Sampling 

Range, of variation, 189, 154; semi- 
interquartile, 151 


Rank correlation, 374 ff.; see also 
Correlation, rank 

Rate, of interest, 30, 76, 228; of 
change, 40, 267, 278, 587; of ex- 
change, 94; averaging of, 125 
Ratio, chart, 29, 35 
Ratio, correlation, 413; see also Cor- 
relation ratio 

Reciprocals, use in measuring rela- 
tionship, 578 ff., 675 ff. 

Reed, Lowell, J., 272; logistic curve, 
675 

Reference cycles, 243, 262; correla- 
tion of, 382 
Regimen, 214, 322 

Regression, lines of, 359 ff.; use of, 
364 ff., 367, 423, ^7; coefficient of 
regression, 359 ff., 363, 479, 561, 
607; for cotton production and 
price, 387, 401; standard error of 
coefficient of regression, 479, 607, 
609 

Relationship, between income and 
auto regii^ration, 326 ff., 352; meas- 
urement of, 325 ff., 334; between 
discount rates, 340 ff.; between 
time series, 380 ff . ; temporal, 391 ff . ; 
linear, see Linear relationship 
Relative deviations, 129; weighted, 
167 

Relative price, 162; arithmetic aver- 
age of, 183; geometric average, 
185, 198; harmonic average, 187; 
weighted average, 196 
Relative variation, measurement of, 
156 ff., 264 

Reliability, measures of, 464; of the 
moan, 4M; of the difference be- 
tween moaaois, 481, 483; of the me- 
dian, 472; of the standard deviation, 
473; of the coefficient of correlation, 
474; index of correlation, 477; coeffi- 
cient of regression, 478 
Residuals, 247 

Residual variability, see Variability, 
residual 

Retail price, index of, 220 

Richardson, A. H., 49 

Rletz, H. L*, 143 

Robertson, R. D., 405 

Robinson, G., 88,274,465 

Root-mean-square deviation, 146, 
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276, 330, 416; see also Standard 
deviation 
Rulon, P. J., 485 

Sample, size of, 117; estimates from, 
1^; m constructing index numbers, 
206 

Sampling, problem of, 452 ff., 460; 
random, 458, 461; generahzmg from 
small samples, 598 ff.; errors of, 
293, 447; see also Standard error 
Sasuly, Max, 270 
Scale, for curve readmg, 39 
Scatter, 99; degree of, 137, 334, 409, 
414, 646; see also Variation 
Scatter diagram, 326, 328, 348, 370, 
416 

Scott, Frances V., 219 
Seasonal variation, 230, 284 ff., 380; 
removal by moving averages, 235; 
measurement of, 287 ff.; adjustment 
of, 317; test of significance of, 522 ff. 
Secular trend, 229, 380, 487; of cotton 
production and price, 383, 385; 
measurement of, 231 ff.; represen- 
tation by moving average, 234; 
by mathematical curves, 244 ff ., 
667 ff.; of business series, 257 ff.; 
selection of curve, 274 ff . 

Selection of curve of trend, 274 
Semi-interquartile range, 151 
Semi-logarithmic oh£Lrt&, 28, 26<1 ; 
advantages of, 40 

Series, periodic, 21; potential, 21; 
continuous, 75 

Sheppard, W. F., correction for 
grouping, 160, 442 ff.; table of 
normal areas, 436 

Shewhart, W, A., 49; distribution of 
the standard deviation, 600 ff . 
Significance, tests of, 464 ff.; see also 
Standard error 
Significant figures, 485 
Sine curve, 21 

Skewness, 96; measures of, 100, 122, 
137 ff., 157 ff., 449; of geometric 
series, 129; of the standard devia- 
tion, 600; of the oorrdation co> 
efficient, 610 

Slope, 293; of regression line, 336, 
350, 359, 361; see also Regression 
coefficient 


Smoothing of curves, 69 ff., 76, 117 
Snedecor, George W., 449, 688 
Snyder, Carl, 229 
Spurr, W. A., 293 

Squares of natural numbers, table of, 
706 

Standard deviation, 145 ff., 330, 
371, 416; chaiacteristic features of, 
155; use in adjusting index num- 
bers, 311, 393, 395; in terms of 
moments, 443; about tiie means 
of arrays, 418; us(‘ of, in varijince 
analysis, 491, 494; scr also Standard 
error 

Standard error, of the hinonnal dis- 
tribution, 434, 660; of the meiui, 
464, 664; of the difference of 
means, 481, 483; of the median, 
472; of the standard deviation, 473; 
of the correlation coeflicient, 474, 
545; of the correlation index, 477; 
of the regression coefficient, 478; of 
the partial correlation ooeificiont, 
560, 615; of the z function, 403, 
615; limitations of above measures, 
480 ff. 

Standard error of i^tiinatis 330 ff.; 
computation of, 333, 338, 370, 388, 
401, 40(), 590; sliortrcut calculation’, 
346, 354; of parabolic functions, 
410; significance of, 319, 371; about 
line of regression, 480; correction 
of, 413, 542; in multiple correla- 
tion analysis, 534, 541 ff.; of loga- 
ritlimic functions, 571 ff.; in nitio 
terms, 573; in reciprocal U‘nus, 
581; zones of estimate, TiiK) ff. 

Starr, G.W., 312 
Statistic, 457 

Statistical dcscriptwm, see Dt‘scrip- 
tion 

Statistical induction; see Induction 
Steinmetz, C. P., 256 
Stewart, Kthclbert, 8-4 
Stock price cycles, relation to busi- 
ness activity, 390, 397 
Straight lino, fitting of, 246; see also 
Linear relationship 
Stratification, in sampling proceiiure, 
462 

Stratified purposive sampling, 463; 
standard error of, 472 
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Student/^ standard error of the rank 
correlation coefficient, 479; stand- 
ard error of the mean, 599; distribu- 
tion of the standard deviation, 
600 ff. 

Sturges, H. A., 57 
Symbols, glossary of, 691 ff. 
Symmetry, 100; degree of, 120; see 
aho Skewness 

Table, of ureas under the normal 
curve, 699; Fisher i table, 603, 
700; of significant values of the cor- 
relation coefficient, 612, 701; of 
relations of the correlation coeffi- 
cient to the z function, 702; of the 
distribution of 2 , 704-5; of the 
powers of natural numbers, 706, 
708; of common logs, 709 
Tabulation of data, 51, 62; in cor- 
relation tables, 3*11, 354, 415 
Tendency, central; are Averages and 
Ci'iitral tendency 
Thompson, F. L., 292 
'iMinc reversal t<n4t, 100 
Time scries, <‘.liart.s, 33, *18, 50; 
analysis of, 225 ff., 295; gniphic 
represi'iitation, 227; removal of 
cycles, 231; fitting a line to, 252; 
measurcunent of seivsonal fluctua- 
tion, 2S‘l ff.; of cyclical fluctuation, 
284; measurement of relations b(‘- 
tww'ii, 380 ff.; ftfc alao Oirrelation 
of time Maries 
Tolley, H. H., 537, 652 
Trend, 262; of price movements, 170; 
of monthly valueJ^, 272; s<»l('etion of 
curve, of, 274 ff.; measuw'inent of, 
225 ff.; secular, we Sc^cular tnmd 


Ungrouped data, 109; product mo- 
ment method for, 352 
Uniformity of nature, principle of, 
457 

Unweight<id index numlH*r, 184 
U. S. Bureau of Internal Revenue, 
320 

U. S. Bureau of T4ibor BtatMie^^, 
stiitistics of fuel production, UW; 
index of wholesale prices, 168, 172, 
176, 212, 216 ff., 282; index numl>er 


used, 193; index of retail prices, 
220; cost of living index, 221 
U. S. Department of Agriculture, in- 
dex of farm prices, 222 

Variability, measures of, 490 ff., 560; 
between classes, 494 ff.; absolute, 
586; residual, 526, 689; see also 
Vaiiance and variation 
Variable, 11; relations betwcHui vari- 
ables, 325 ff., 359, 360 
Variance, analysis of, 490 ff.; z tost 
of difference in variability, 492, 
506, 513, 517; in testing variability 
between classes, 494; in the meas- 
urement of relationship, 501 ff., 
519; in testing linearity, 508 ff.; 
curvilinear hyjHithesiH, 514 ff.; test- 
ing seasonal fluctuation, 522 ff.; 
in testing the multiple correlation 
coefficient, 545; in testing signifi- 
cance of princijiles of classification, 
681 ff. 

Vnriuiioii, 97; measures of, 09, 137 ff., 
330; absolutes 138; (‘.omparison of 
mejisures of, 153, 155; measures of 
difference in, *190 ff.; coefficient of, 
156; in price relatives, 171 ff.; 
within and between arrays, 502; 
see also tk'usonal and cyclical fluc- 
tuation 

Verhulst, P. F., 272 

Wage statistics of, 96, 103, 105, 111, 
124 

Wahr, (h'-orge, •137 

Walsh, (X M., 130, 201 ; ratio variabil- 
ity, 596 

Weighted average, 104, 106; of rela- 
tive priccss, 106; g<H)metric, 125; 
moving avoragts 244 
Weldon, W. F. R., dice experiment, 
432, 618 

Wheat, exiMiris of, 33; yield com*- 
lated wiiii fertilizer, 415 
Whipple, G. a, 87 
Whittaker, E. T., 88, 274, 465 
Wliohssale price, 211 ff.; index of, 
216 ff.;«ceafAe Price 
Working, Holbrook, harmonic mean, 
588 
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Yates, F., 688 z test of variability, 492, 506, 514, 

Yule, G. U., 60, 418; Chi-square fre- 517; tables of, 704-5; standard error 
quenoies, 622, 629 of, 615 

z transformation of correlation coeffi- 
Zone, of estimate; zee Estimate, Dis- cient, 613 ff., 702 
persion 






